This hands-on guide walks you through setting up Apache Spark in Microsoft Fabric to process and analyze sales data. You'll learn how to:
- Create a Fabric Workspace
- Set up a Lakehouse for data storage
- Use Spark Notebooks for data transformation
- Perform data analysis and machine learning
Step 1: Set Up Microsoft Fabric Environment
1.1. Access Microsoft Fabric
Go to Microsoft Fabric Portal
Sign in with your Microsoft 365 account (ensure Fabric is enabled for your tenant).
1.2. Create a New Workspace
Click Workspaces (left sidebar).
Select New Workspace.
Enter a name (e.g., "SUBBYWORKSPACE").
Choose "Data Engineering" as the workload.
Click Apply.
π Need Help?
For a detailed guide on Steps 1 & 2 (creating a workspace and Lakehouse), check out:
Step-by-Step: Create & Load Data into a Lakehouse in Microsoft Fabric
Step 2: Create a Lakehouse for Data Storage
2.1. Create a Lakehouse
Inside your workspace, click New β Lakehouse.
Name it "SUBBYLAKEHOUSE" and click Create.
2.2. Upload Sample Data
Download Retail Sales Data (CSV) (or use any sales dataset).
In your Lakehouse, go to Files β Upload β Select the CSV file.
After upload, Click on the Workspace and then SUBBYLAKEHOUSE to see your tables and files folder
Step 3: Create a Spark Notebook
3.1 Click on files ** and then the **order folder to display your csv files that contains your dataset.
Click on Open notebook and Click on** New notebook **
3.2 we have successfully created our Spark notebook and we can give it a name by click Notebook at the top.
Step 4: Load and Explore Data
4.1. Read Data into Spark DataFrame*
To read and Load data into Spark Dataframe, Adds a code cell and enter the Pyspark code...should be modify according to your need.
from pyspark.sql.types import *
orderSchema = StructType([
StructField("SalesOrderNumber", StringType()),
StructField("SalesOrderLineNumber", IntegerType()),
StructField("OrderDate", DateType()),
StructField("CustomerName", StringType()),
StructField("Email", StringType()),
StructField("Item", StringType()),
StructField("Quantity", IntegerType()),
StructField("UnitPrice", FloatType()),
StructField("Tax", FloatType())
])
df = spark.read.format("csv").schema(orderSchema).load("Files/orders/2019.csv")
display(df)
Aggregate and group data in a DataFrame
Add a code cell, and enter the following code:
from pyspark.sql.functions import *
yearlySales = df.select(year(col("OrderDate")).alias("Year")).groupBy("Year").count().orderBy("Year")
display(yearlySales)
Run the cell. Examine the output. The results now show the number of sales orders per year:
- The import statement enables you to use the Spark SQL library.
- The select method is used with a SQL year function to extract the year component of the OrderDate field.
- The alias method is used to assign a column name to the extracted year value.
- The groupBy method groups the data by the derived Year column.
- The count of rows in each group is calculated before the orderBy method is used to sort the resulting DataFrame.
Conclusion
This step-by-step guide gives you hands-on experience with Spark in Microsoft Fabric! π For detailed instructions on Steps 1 & 2, refer to:
π Step-by-Step: Create & Load Data into a Lakehouse in Microsoft Fabric
Top comments (0)