DEV Community

Paulet Wairagu
Paulet Wairagu

Posted on

QN : Ingest and transform data in a lakehouse

  • lakehouse has two storage areas ; Files and Tables
  • Files
    • Store structured, queryable data by sql
    • Supports schema definitions and ACID transactions
  • Tables
    • Stores Raw or semi-structured data(CSV, parquet, JSON)
    • No schema support
    • Flexible for data explorations
  • Schema allows for logical ordering of data on business functions or domain (sales,marketing etc)
  • A dbo schema is enabled by default once a lakehouse is created
  • Schema-enabled lakehouses also support schema-level permissions and cross-workspace queries using the four-part namespace
  • Lakehouse mode : Lakehouse Explorer and SQL analytics endpoint
    • Lakehouse Explorer: Allows managing, Update, create, upload of data.You can switch between tables in the lakehouse
    • SQL anlytics endpoit : Does not allow modifying of the underlying data. You can query using TSQL at read only mode.
  • Loading data into lakehouse:
  1. Upload data into files/ folders on the explorer
  2. Load into delta tables (no code)
  3. Transform using power query in dataflow gen2
  4. INgest into notebooks using apache spark (programmatically)
  5. Use Copy data to move data into differnt sources using data factory pipelines

-Shortcuts allow you to reference external data reducing copies. Access is managed by One Lake.

  • Schema shortcuts map an entire schema to a folder of Delta tables in another lakehouse.
  • SQL analytics endpoint provides read-only access to lakehouse tables using T-SQL queries.
  • SQL USE CASES : adhoc queries, BI connections to power bi or azure data studio, Data validation
  • You can use SQL views to store reusable query logic. Views are useful when you need to apply business rules, simplify complex joins, or provide curated data for downstream consumers.
  • You can use Spark SQL for SQL-like queries or PySpark for programmatic data manipulation in Notebooks.
  • Spark SQL works well for familiar SQL patterns. PySpark provides greater flexibility for complex transformations and integration with Python libraries.
  • Power BI is the business intelligence and reporting layer in Fabric. It serves as the consumption layer where business users access data through interactive reports and dashboards.
  • Power BI can connect to lakehouse data in two ways:
    • Query the SQL analytics endpoint
    • Create a semantic model

Top comments (0)