Implement a Partition Strategy for Streaming Workloads – The Storage of Data


Chapter 7, “Design and Implement a Data Stream Processing Solution,” discusses partitioning data within one partition and across partitions. Exercise 7.5 features the hands‐on implementation of partitioning streaming workloads. Partitioning in the streaming sense has a lot to do with the allocation of datasets onto worker nodes, where a worker node is a virtual machine configured to process your data stream. Partitioning the streamed data improves execution times and efficiency by processing all data grouped together by a similar key on the same machine. Processing all the like data on a single node removes the necessity of splitting datasets and then merging them back together after processing is complete. How you achieve this will become clearer in Chapter 7, but the keyword for this grouping is windowing, which enables you to group together like data for a given time window and then process the data stream once that time window is realized.

Using partitions effectively in a streaming workload will result in the parallel processing of your data stream. This is achieved when you have multiple nodes that process these partition‐based datasets concurrently. Whether the input stream and the output stream support partitions and number the same are important considerations to achieve parallelism. Not all streaming products support partitions explicitly (see Table 7.6). Some products create a partition key for you if one is not supplied, whereas other products do not. Some products like Power BI do not support partitions at all, which is important to know because it means if you plan on using Power BI as an output binding, you cannot achieve parallelism. Parallelism cannot be achieved in this scenario because the number of partitions configured for your input binding must be equal to the number of output partitions. There will always be one. See Figure 7.31 and Figure 7.37 for a visualization of parallelism using multiple partition keys.

Implement a Partition Strategy for Azure Synapse Analytics

Figure 2.10 showed the best example of partitioning in Azure Synapse Analytics. On a dedicated SQL pool, you can use the hash distribution key to group related data, which is processed on multiple nodes in parallel. For example, the following SQL snippet:

DISTRIBUTION = HASH([ELECTRODE_ID])

You implemented partitioning in Exercise 3.7 and Exercise 4.4.

Leave a Reply

Your email address will not be published. Required fields are marked *