DEV Community

Cover image for Step-by-Step Guide: Using Apache Spark in Microsoft Fabric Project Overview
SUBAIR NURUDEEN ADEWALE
SUBAIR NURUDEEN ADEWALE

Posted on

Step-by-Step Guide: Using Apache Spark in Microsoft Fabric Project Overview

This hands-on guide walks you through setting up Apache Spark in Microsoft Fabric to process and analyze sales data. You'll learn how to:

  1. Create a Fabric Workspace
  2. Set up a Lakehouse for data storage
  3. Use Spark Notebooks for data transformation
  4. Perform data analysis and machine learning

Step 1: Set Up Microsoft Fabric Environment

1.1. Access Microsoft Fabric
Go to Microsoft Fabric Portal

Sign in with your Microsoft 365 account (ensure Fabric is enabled for your tenant).

1.2. Create a New Workspace
Click Workspaces (left sidebar).

Select New Workspace.

Enter a name (e.g., "SUBBYWORKSPACE").

Choose "Data Engineering" as the workload.

Click Apply.

πŸ“Œ Need Help?
For a detailed guide on Steps 1 & 2 (creating a workspace and Lakehouse), check out:
Step-by-Step: Create & Load Data into a Lakehouse in Microsoft Fabric

Step 2: Create a Lakehouse for Data Storage
2.1. Create a Lakehouse
Inside your workspace, click New β†’ Lakehouse.

Name it "SUBBYLAKEHOUSE" and click Create.

2.2. Upload Sample Data
Download Retail Sales Data (CSV) (or use any sales dataset).

In your Lakehouse, go to Files β†’ Upload β†’ Select the CSV file.

After upload, Click on the Workspace and then SUBBYLAKEHOUSE to see your tables and files folder

Step 3: Create a Spark Notebook

3.1 Click on files ** and then the **order folder to display your csv files that contains your dataset.

Click on Open notebook and Click on** New notebook **

3.2 we have successfully created our Spark notebook and we can give it a name by click Notebook at the top.

Step 4: Load and Explore Data

4.1. Read Data into Spark DataFrame*
To read and Load data into Spark Dataframe, Adds a code cell and enter the Pyspark code...should be modify according to your need.

from pyspark.sql.types import *
orderSchema = StructType([
StructField("SalesOrderNumber", StringType()),
StructField("SalesOrderLineNumber", IntegerType()),
StructField("OrderDate", DateType()),
StructField("CustomerName", StringType()),
StructField("Email", StringType()),
StructField("Item", StringType()),
StructField("Quantity", IntegerType()),
StructField("UnitPrice", FloatType()),
StructField("Tax", FloatType())
])

df = spark.read.format("csv").schema(orderSchema).load("Files/orders/2019.csv")

display(df)

Aggregate and group data in a DataFrame

Add a code cell, and enter the following code:

from pyspark.sql.functions import *

yearlySales = df.select(year(col("OrderDate")).alias("Year")).groupBy("Year").count().orderBy("Year")

display(yearlySales)

Run the cell. Examine the output. The results now show the number of sales orders per year:

  1. The import statement enables you to use the Spark SQL library.
  2. The select method is used with a SQL year function to extract the year component of the OrderDate field.
  3. The alias method is used to assign a column name to the extracted year value.
  4. The groupBy method groups the data by the derived Year column.
  5. The count of rows in each group is calculated before the orderBy method is used to sort the resulting DataFrame.

Conclusion

This step-by-step guide gives you hands-on experience with Spark in Microsoft Fabric! πŸš€ For detailed instructions on Steps 1 & 2, refer to:
πŸ“– Step-by-Step: Create & Load Data into a Lakehouse in Microsoft Fabric

Top comments (0)