Apache Spark Job Definition – The Storage of Data


The Apache Spark job definition feature enables you to execute code snippets using PySpark (Python), Spark (Scala) or .NET Spark (C#, F#).

The first text box is requesting a main definition file. The type of file is based on the selected language. For example, PySpark is expecting a PY file; .NET Spark can be either a DLL or EXE file; and when using Scala, a JAR file is expected. The value placed into that text box represents the location of the code file, for example, stored on an ADLS container. It may look something like this:

abfss://*@*.dfs.core.windows.net/EMEA/brainjammer/archiveData.py

The next text box, Command Line Arguments, gives you the options to pass arguments to the code. The values in Figure 4.29 are in 2022 04 01 and could be used to identify the path to be archived. The following code example uses these arguments and archives the data. The full archiveData.py program file can be found on GitHub at https://github.com/benperk/ADE, in the Chapter04 directory.

sc._jsc.hadoopConfiguration() \
  .set(‘fs.defaultFS’, ‘abfss://*@*.dfs.core.windows.net/EMEA/brainjammer/’)
endpoint = hadoop_config.get(‘fs.defaultFS’)
path = sys.argv[1] + “/” + sys.argv[2] + “/” + sys.argv[3] + “/” + sys.argv[4]
if (fs.exists(sc._jvm.org.apache.hadoop.fs.Path(endpoint + path))):
  fs.delete(sc._jvm.org.apache.hadoop.fs.Path(endpoint + path), True)

FIGURE 4.29 Azure Synapse Analytics Develop hub, Apache Spark job definition

The command‐line arguments are optional and can instead be passed to the job via a pipeline that is used to execute it. That means the path that contains files that are no longer needed can be different with each execution of the pipeline that archives/deletes the unnecessary data. It would also be good practice to send the ADLS endpoint as an argument as well. This was done for simplicity reasons; in general, it is not good practice to hard code any values within source code.

Browse Gallery

This feature was introduced in Chapter 3. It is the same content from the Develop hub as exists in the Data and Integrate hubs. Figure 3.55 illustrates how this looks, in case you want to see it and read its description again.

Import

SQL scripts, KQL scripts, notebooks, and Apache Spark job definitions all have the option to export the configuration. For example, when you hover over a notebook, click the ellipse (…), and select Export, you will see that you can export the configuration as a notebook (.ipynb), HTML (.html), Python (.py), or LaTeX (.tex) file. Then, you can use the Import feature to share the configuration or import it into another Azure Synapse Analytics workspace.

Leave a Reply

Your email address will not be published. Required fields are marked *