Deliver Data in Parquet Files – The Storage of Data
In Exercise 4.7 you performed a conversion of brain waves stored in multiple CSV files using the following PySpark code snippet:
%%pyspark
df = spark.read.option(“header”,”true”) \
.csv(‘abfss://*@*.dfs.core.windows.net/EMEA/brainjammer/in/2022/04/01/18/*’)
display(df.limit(10))
Then you wrote that data, loaded in a DataFrame, to Parquet format using the following code syntax:
df.write.mode(“overwrite”) \
.parquet(‘/EMEA/brainjammer/out/2022/04/03/17/sampleBrainwaves.parquet’)
The data source does not have to be from another file; it can also come from a database table. By using Spark SQL, you can retrieve that data from the table and then write the data to a Parquet file, as follows:
df = spark.sql(‘SELECT * FROM FactREADING’)
You can perform calculations using aggregates or filter the data by using either Spark SQL or the methods available in the DataFrame.
df = spark.sql(“SELECT * FROM FactREADING WHERE FREQUENCY == ‘GAMMA'”)
dfSceanrio = df.where(“FREQUENCY == ‘GAMMA'”)
Delivering files in Parquet format means that you have placed your data into the most performant format. Querying the data source, parsing it, compressing it, and then moving it to a location for further processing through the next stages of Big Data are tasks you have already done and will do again.
Maintain Metadata
Metadata is information that exists for the purpose of explaining what the data is. Metadata is data about data; the better your metadata is, the more useful your data is. Files contain names, creation dates, and sizes, while databases have table relationships and schemas that represent metadata. Chapter 3 discussed metadata extensively, specifically Exercise 3.4 and Exercise 3.14, which walk you through the creation of an external Hive metastore.
The discipline known as master data management (MDM) focuses on the accuracy, accountability, and consistency of an enterprise’s data assets. This requires a skill set that itself is worthy of a career. Consider an example where you have a DBMS, or better yet a database, running on a dedicated SQL pool in Azure Synapse Analytics. The database has numerous tables, view, schemas, stored procedures, etc., with data flows, jobs, activities, and pipelines running at random intervals. If something stops working, where should you look? Perhaps you would check to see if a table has been modified or dropped. You would do this by looking at the database metadata. For example, the following SQL query can identify when a table was modified:
SELECT OBJECT_SCHEMA_NAME(object_id) schemaName, OBJECT_NAME(object_id)
tableName, modify_date
FROM sys.tables
This query will not identify whether a table has been dropped. There are some third‐party tools, however, that can take an inventory of your metadata. Capture and store a document that contains your current data inventory. Companies with a low data requirement have been known to add and update metadata into Excel manually. If there is ever a problem that might be related to the database structure, you can compare the historical data inventory to the current metadata. However, this approach, commonly called fire‐fighting mode, is not optimal for the enterprise. You need to have a formal process for managing and implementing changes that includes a step for updating your metadata‐tracking documentation and programs. In addition to Excel, there are some companies that specialize in this area, like Informatica and Collibra. Finally, having solid metadata is a must for implementing a solution with Azure Purview, which provides data discovery and governance support.
Implement a Dimensional Hierarchy
Refer to Figure 3.19 to see a visual representation of a dimensional hierarchy. Perform Exercise 4.14 to implement a dimensional hierarchy on an Azure Synapse Analytics dedicated SQL pool.