Implement a Partition Strategy – The Storage of Data
Chapter 3 discussed designing partition strategies. Remember the two primary reasons for partitioning data. The first reason is to improve the speed at which a query for data responds with the dataset. When data is partitioned, it means that it is organized into smaller groups bound together by some common attribute. For example, if there was a partition based on the scenario from which a brain wave was captured, a query would return more quickly if there was a partition that included all readings from the PlayingGuitar scenario than if there was a partition that included all scenarios.
The second reason has to do with a combination of size and retention. Executing a query on a partition with 1 billion rows will take longer than running the same query on a partition with 1 million rows. Consider that you partition data using a datetime stamp on a table that contains 1 billion rows. In this hypothetical situation, partitioning based on a datetime stamp would reduce the queryable data considerably, thus reducing the size of data being queried. This partitioning approach also enables the deletion of data once it is no longer relevant. For example, once the datetime stamp is over 30 days old, it can be removed or archived, again reducing the size and amount of data being queried.
The following sections provide some additional insights into data partitioning for specific scenarios. In most cases, the implementation of these scenarios comes in later chapters and exercises, as noted.
Implement a Partition Strategy for Files
You implemented a partition strategy for a CSV and a Parquet file in Exercise 4.2 and Exercise 4.12, respectively. In Exercise 4.2 you applied the partitioning strategy to a CSV file and scenario, as shown in the following code snippet:
%%pyspark
df = spark.read \
.load(‘abfss://[email protected]/SessionCSV/…ODE_FREQUENCY_VALUE.csv’, \
format=’csv’, header=True)
df.write \
.partitionBy(‘SCENARIO’).mode(‘overwrite’).csv(‘/SessionCSV/ScenarioPartitions’)
data = spark.read.csv(‘/SessionCSV/ScenarioPartitions/SCENARIO=PlayingGuitar’)
The first line loaded the CSV file containing brain waves readings taken in different scenarios into a DataFrame. The second line of code used the partitionBy() method to separate the CSV file into partitions based on the scenarios contained in the file. The third line of code loaded the data that exists in the PlayingGuitar partition. It would be easily agreed that if the original CSV file that contained all scenarios had 1 billion rows, a query against it would take longer than simply loading only the ones matching a specific scenario on an independent partition.
In Exercise 4.12, you used a year and month as the basis for your file partitioning. The following code snippet is an example of how partitioning was performed using the year and month in a Parquet file:
%%pyspark
df = spark.read \
.load(‘abfss://*@*.dfs.core.windows.net/out-path/file.parquet’,
format=’parquet’, header=True)
df_year_month_day = (df.withColumn(“year”, year(col(“SESSION_DATETIME”)))) \
.withColumn(“month”, month(col(“SESSION_DATETIME”)))
df_year_month_day.write \
.partitionBy(“year”, “month”).mode(“overwrite”) \
.parquet(‘abfss://*@*.dfs.core.windows.net/EMEA/brainjammer/ cleansed-data/2022/04/10’)
df = spark.read \
.load(‘abfss://*@*.dfs.core.windows.net/path/year=2021/month=7’,
format=’parquet’, header=True)
print(df.count())
display(df.limit(10))
The data is first loaded into a DataFrame and is then loaded into a new DataFrame with the addition of a year and month column. The data is then partitioned using the partitionBy() method and retrieved using the following parameters passed to the load() method:
year=2021/month=7
The results are then counted and rendered to the console using the display() method.