Implement a Serving Layer with a Star Schema – The Storage of Data-2
- Disable sampling in the data flow you created in step 4 ➢ commit the data flow to your source code repository ➢ navigate back to the pipeline ➢ commit the pipeline artifacts to your GitHub repository ➢ choose the Publish menu item ➢ review the pending changes ➢ click the OK button ➢ after the publishing has completed (be patient; view the status using the Notifications task bar item), select the Add Trigger menu drop‐down ➢ select Trigger Now ➢ click the OK button ➢ click the Monitor hub ➢ click the Pipeline Runs link in the Integration section to monitor the status ➢ once complete (be patient), rerun the SQL queries from step 8.
Congratulations on creating your first Azure Synapse Analytics pipeline. Reflect a bit on all the features it relied on and how much you have learned about them so far. You have come a long way, but there is still much more to do and learn, beginning with the creation of the tables in step 1. You should already understand the purpose and differences between dimension tables and fact tables. In the SQL syntax that created them, they are prefixed with their associated table category. There are a few new concepts that have not yet been thoroughly discussed: ALTER, PRIMARY KEY, and UNIQUE SQL statements and an integration/staging/temporary table. The ALTER command is used to make a change to a database object—in this case, a table. The change to the table is to add a primary key constraint, which is similar in principle to, but not the same, as a relational database. Dedicated SQL pools do not support the concept of a foreign key, and therefore relationships between tables cannot be enforced. The following table constraints related to a dedicated SQL pool are also true.
• The UNIQUE constraint is supported only when NOT ENFORCED is used.
• A PRIMARY KEY is supported only when both NONCLUSTERED and NOT ENFORCED are used.
Both a unique key and a primary key prevent duplicates and ensure that the values on data rows are unique. A unique key can contain null values, while a primary key cannot. That means a unique key guarantees only the uniqueness of non‐null values, while a primary key does not allow nulls, so you are certain there are no duplicates in the table. For example, consider the following data that might exist on the [brainwaves].[TmpREADING] table:
+————+————+————–+————–+—–+
| READING_ID | SESSION_ID | ELECTRODE_ID | FREQUENCY_ID | … |
+————+————+————–+————–+—–+
| 1 | null | 1 | 3 | … |
| null | null | 2 | 1 | … |
| null | 1 | 3 | 2 | … |
| null | null | 2 | 1 | … |
| 2 | 2 | 4 | 5 | … |
+————+————+————–+————–+—–+
The [brainwaves].[TmpREADING] table, as you might recall, has a unique key on the READING_ID and the SESSION_ID. That means that either of those columns, or both at the same time, can contain null values. There cannot be another 1 in the READING_ID column or another set of 2s in both the READING_ID and SESSION_ID in the table. However, there can be numerous rows where both READING_ID and SESSION_ID are null. This behavior would not be supported had the READING_ID and SESSION_ID been a primary key. The foremost reason for choosing UNIQUE vs. PRIMARY KEY is based on whether you expect nulls in columns that help identify rows of data in a table uniquely. Finally, the [brainwaves].[TmpREADING] table is a staging table—that is, a table used to temporarily store data so that it can be transformed later. This table would reside in the batch layer of the lambda architecture. Once the data is transformed and placed into the [brainwaves].[FactREADING] table, it would then be on the serving layer. In Chapter 5, “Transform, Manage, and Prepare Data,” you will continue with this pipeline and transform the data, then move it to a place along the DLZ stages where additional analysis can be performed.
The movement of the data employs two different approaches. The first approach uses the COPY INTO SQL statement, which is not as performant when compared to PolyBase but has much more flexibility. The flexibility comes in the form of storing error files in a custom location and having more supported datetime formats, additional wildcard search parameters, and automatic schema discovery. The second approach, which copies data from an Azure SQL database to the SQL pool, uses a data flow that employs PolyBase behind the scenes when you are using an Azure Synapse Analytics source or sink. PolyBase is a very fast and efficient way to copy and move around data from many source types, and in many formats, to a different source in a different format.