Azure Synapse Analytics Data Hub Data Flow – The Storage of Data
- Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ on the Overview blade, click the Open link in the Open Synapse Studio tile ➢ select the Manage hub ➢ click SQL Pools ➢ make sure your dedicated SQL pool is running ➢ select the Data hub ➢ expand the SQL Database ➢ expand the dedicated SQL pool ➢ hover over Tables ➢ click the ellipse (…) ➢ select New SQL Script ➢ select New Table ➢ and then enter the following SQL command (assuming the table already exists from a previous exercise):
DROP TABLE brainwaves.DimELECTRODE
2. Create an SCD table, and then execute the following SQL script, which is located in the folder Chapter04/Ch04Ex09 on GitHub at https://github.com/benperk/ADE and named createSlowlyChangingDimensionTable.sql:
CREATE TABLE [brainwaves].[DimELECTRODE]
3. Populate the SCD table with data, and then execute the INSERT statements from the file insertSlowlyChangingDimensionTable.sql, which is in the folder Chapter04/Ch04Ex09 on GitHub:
DECLARE @SK_ID AS VARCHAR(10)
DECLARE @START_DATE AS DATETIME2 = GETDATE()
4. Select the inserted data using the following SQL command:
SELECT * FROM brainwaves.DimELECTRODE
5. Wait some time ➢ perform an UPDATE by executing the
SQL code in the file named updateSlowlyChangingDimensionTable.sql (available in folder Chapter04/Ch04Ex09 on
GitHub) ➢ and then execute another SELECT statement:
The query that creates the table is like any dimension table, albeit with some additional columns. Those columns are a surrogate key, a start date, an end date, and a flag that identifies whether the row is the current one. These columns are required for which SCD type? The INSERT statements did contain some syntax that was leaning a bit toward coding and away from vanilla SQL. The INSERT statements required a surrogate key, which is identified by the variable @SK_ID. The value for this variable is captured by selecting the total number of rows on the table, increasing it by 1, and prefixing it with an E. See step 4 in the previous exercise if this is not clear. That variable is then used as part of the INSERT statement. The other variable is the current date, which is captured by executing the GETDATE() method and storing the result in @START_DATE. The END_DATE is one that is recognized as no date, and IS_CURRENT is set to 1, which means it is the current row.
The UPDATE logic is a bit more complicated. The code first declares two variables, @SK_ID and @START_DATE, and then checks the table being updated to see if the row already exists. If it does, then an UPDATE is necessary; however, if the row does not exist, then the code performs an INSERT. This is what many refer to as an UPSERT. When an UPDATE occurs, there is also an INSERT, which adds a new row to the table. The new row contains the same SK_ID as the one being updated, along with the new values: the new value for @START_DATE and the IS_CURRENT set to 1. The UPDATE sets the END_DATE column to the @START_DATE value and sets IS_CURRENT to 0. In case you have not already realized it, this is a Type 2 SCD table.
There are many approaches for implementing SCD tables. The approach in Exercise 4.10 is a manual one and is helpful for you to get a good understanding of how this works. Another approach could be something like the following.
- You have one staging table and one target table.
- Data is loaded into the staging table.
- Tables are joined to find matches.
- If the record exists in both the staging table and the target table, the record is updated on the target table.
- If the record exists only in the staging table and not in the target table, the record is inserted into the target table.
That scenario is the one in which you would employ the MERGE SQL command. (Refer to Chapter 3.) Both of these approaches are fully supported using DataFrames and tables using a Spark pool or Azure Databricks.