Category: Azure Synapse Analytics and ADLS

Implement Efficient File and Folder Structures – The Storage of DataImplement Efficient File and Folder Structures – The Storage of Data

df = spark.read.load(‘abfss://@.dfs.core.windows.net/in-path/file.csv’, df.write.mode(“overwrite”) \ df = spark.read.load(‘abfss://@.dfs.core.windows.net/out-path/file.parquet’, print(df.count()) from pyspark.sql.functions import year, month, col df = spark.read \ .load(‘abfss://@.dfs.core.windows.net/out-path/file.parquet’, format=’parquet’, header=True) df_year_month_day = (df.withColumn(“year”, year(col(“SESSION_DATETIME”)))) \ .withColumn(“month”, month(col(“SESSION_DATETIME”))) from [...]

Azure Synapse Analytics Data Hub Data Flow – The Storage of DataAzure Synapse Analytics Data Hub Data Flow – The Storage of Data

DROP TABLE brainwaves.DimELECTRODE 2. Create an SCD table, and then execute the following SQL script, which is located in the folder Chapter04/Ch04Ex09 on GitHub at https://github.com/benperk/ADE and named createSlowlyChangingDimensionTable.sql:  CREATE [...]

Implement a Partition Strategy for Streaming Workloads – The Storage of DataImplement a Partition Strategy for Streaming Workloads – The Storage of Data

Chapter 7, “Design and Implement a Data Stream Processing Solution,” discusses partitioning data within one partition and across partitions. Exercise 7.5 features the hands‐on implementation of partitioning streaming workloads. Partitioning [...]

Deliver Data in Parquet Files – The Storage of DataDeliver Data in Parquet Files – The Storage of Data

In Exercise 4.7 you performed a conversion of brain waves stored in multiple CSV files using the following PySpark code snippet: %%pysparkdf = spark.read.option(“header”,”true”) \  .csv(‘abfss://*@*.dfs.core.windows.net/EMEA/brainjammer/in/2022/04/01/18/*’)display(df.limit(10)) Then you wrote that [...]