Implement a Partition Strategy for Analytical Workloads – The Storage of Data


When you begin to brainstorm the storage of data for an analytical workload, terms such as hybrid transaction/analytical processing (HTAP) and online analytical processing (OLAP) might come to mind. Both of those data processing types are useful for analytical workloads. HTAP is a hybrid mechanism that combines OLAP and online transactional processing (OLTP). That means your data storage resource can efficiently respond to real‐time transactions and data analytics processes, where transactions are inserts, updates, and deletes, and data analytics are selects and reporting. Chapter 6, “Create and Manage Batch Processing and Pipelines,” discusses how to implement an HTAP solution in Azure Synapse Analytics using something called change data capture (CDC). CDC is a feature that captures transaction logs and transfers them in near real time so that data analytics can be run against them. This behavior would be similar to the serving layer in the lambda architecture, which is fed by the speed layer, meaning that you can analyze your transactional data and gather insights from it in near real time.

Some other important elements to consider when creating a partition strategy for analytical workloads are the stack, the datastore, the optimal balance between file size and the number of files, table distribution, and the law of 60. Determining whether your analytical workloads require a Spark cluster or a serverless or dedicated SQL pool is an important consideration. For example, if you want to (or must) use delta tables, then you will need to use a Spark cluster. The datastore is determined by the format of your data, for example, relational, semi‐, or non‐structured. Different products are necessary for optimizing the storage of data based on the data format. In Chapter 10, “Optimize and Troubleshoot Data Storage Processing,” you perform an exercise in which you optimize the ratio between file size and the number of files. For example, if your queries must load a large number of small files, you need to consider the latency of the I/O transaction required to load each one. As discussed in Chapter 10, there is an optimal balance between the number of files and the size of files, which needs to be found for optimally partitioning analytical workload data. Chapter 10 also discusses table distribution (round‐robin, hash, and replicated), which you implemented in Exercise 4.4. Finally, the law of 60 was first introduced in Chapter 2 and is discussed in numerous other chapters, including extensively in Chapter 10, and is implemented in Exercise 5.6.

Leave a Reply

Your email address will not be published. Required fields are marked *