Implement Data Archiving – The Storage of Data
Archiving data saves costs, improves performance, and may be necessary due to data privacy compliance. Cost savings are realized because you are charged by the amount of storage space required by your data. Removing the unnecessary data or changing the storage access tier from Hot to Archive for long‐term retention reduces cost. Performance is improved because the less data your queries are required to parse, the faster the result set is returned. Consider that a specific type of data in your data lake is no longer valid after a given age. If the date of validity is, for example, after 2022‐03‐31, then selecting that data and using it in your data analytics solution could have a negative impact on your conclusions. Instead of expecting all developers and consumers of the data to know to exclude this from their queries, you should consider archiving or purging this data from your data lake completely. From a data compliance perspective, not only is the location where the data is saved important, but so is the duration in which it is stored. Based on the data compliance requirements of the policy you are mandating, consider whether you are allowed to store the data over a year, month, day, or even at all. A common lifetime data retention amount is 7 years.
Archiving Data on Azure Synapse Analytics
A common and efficient technique for archiving data when it comes to a dedicated SQL pool running in your Azure Synapse Analytics workspace is by using a datetime stamp. A typical scenario is one where you do not want to remove the data completely but instead remove it from the primary locations where the retrieval and analysis are happening. You want to avoid your consumers and users executing queries that select all the data from tables, when much of the old data is no longer relevant. The performance of such queries would be considered latent. A method for archiving the data is to create another table that holds older data. What is considered old depends on your data and your requirements. The following snippet retrieves brain wave readings from the READING table that are over 5 years old and copies the rows into a table named READING_HISTORY, which resides on a schema named archive:
SELECT * INTO [archive].[READING_HISTORY]
FROM READING WHERE DATEDIFF(YEAR, READING_DATETIME, GetDate())> 5
Next, the following SQL statement removes the same data from the READING table:
DELETE FROM READING WHERE DATEDIFF(YEAR, READING_DATETIME, GetDate())> 5
Deleing the rows that are older than 5 years reduces the table’s size and would increase the performance of a query that retrieves all records from the READING table. If a user needs older brain waves, they can be retrieved from the READING_HISTORY table. I also recommend that you create a backup or snapshot prior to deleting a large amount of data. This can be achieved using the Restore feature for a given dedicated SQL pool and selecting the User‐defined Restore Points option. Finally, not all tables will have a datetime stamp. If this is the case, consider adding one as part of the pipeline transformation processing.
Archiving Files on ADLS
From an ADLS perspective there are a few options for data archiving, such as those described in Table 4.3. These actions are components of the built‐in lifecycle management policy feature of your Azure storage account.
TABLE 4.3 ADLS archiving actions
Action | Action name | Blob type | Snapshot |
Move to Cool tier | tierToCool | blockBlob | Yes |
Move to Archive tier | tierToArchive | blockBlob | Yes |
Delete | Delete | block/appendBlob | Yes |
Move to Hot from Cool | enableAutoTierToHotFromCool | blockBlob | No |
The most expensive access tier for a blob is Hot. Therefore, if the data in a given blob, which in this context is synonymous with a data file, is not accessed, then in order to save costs, you can move it to a less expensive tier. Complete Exercise 4.5, where you will implement data archiving for ADLS.