DEV Community: Indrasen

Unity Catalog in Azure Databricks — Everything You Need to Know

Indrasen — Fri, 20 Jun 2025 06:15:09 +0000

Unity Catalog in Azure Databricks — Everything You Need to Know [2025 Edition]

If you're working with data on Azure, you’ve probably heard of Unity Catalog. It's a powerful feature within Azure Databricks that brings data governance, security, and organization to the forefront of your data workflows.

This guide will walk you through everything — from setting up Unity Catalog to working with Delta Lake, volumes, and real-time data ingestion.

🚀 What is Databricks?

Databricks is a cloud-based platform built on Apache Spark that unifies data engineering, data science, machine learning, and analytics workflows.

🧱 Azure Databricks Architecture

Azure Databricks has two planes:

Control Plane: Hosts backend services (UI, job scheduler).
Compute Plane: Where your jobs run (clusters, notebooks).

Workspaces have their own storage accounts that contain:

System data (job logs, notebook revisions)
DBFS (Databricks File System)
Unity Catalog workspace catalog

📚 What is Unity Catalog?

Unity Catalog is a centralized governance layer for your data. It manages:

What data exists
Who can access it
Where it lives
How it’s used

It uses a 3-tier structure:

Catalogs (e.g., sales)
Schemas (e.g., raw, cleaned)
Objects (tables, views, volumes, functions, ML models)

🔐 Managed vs External Tables

Feature	Managed Table	External Table
Storage	Controlled by Databricks	Controlled by you
On DROP command	Deletes metadata + data	Deletes only metadata
Use case	Internal pipelines	External sources in ADLS

🔎 Key Unity Catalog Features

Access Control: Unified permission management across workspaces.
SQL-Based Security: GRANT SELECT ON TABLE... just like databases.
Audit Logs: Built-in tracking of who accessed what, when.
Lineage Tracking: See data flow across notebooks and jobs.
Discovery: Tag and describe datasets easily.
System Tables (Preview): Query usage and audit info with SQL.

📦 Volumes in Unity Catalog

Volumes manage unstructured data (CSV, JSON, logs) with the same governance as tables.

Located inside schemas.
Two types: Managed and External.
Queryable via SQL or notebooks.

🔁 Delta Lake + Unity Catalog

Delta Lake is Databricks' storage engine that supports:

ACID transactions
Schema evolution
Time travel
Upserts
Optimized performance (via OPTIMIZE and Deletion Vectors)

Tombstoning

Old files aren’t deleted immediately. They’re marked as “tombstoned” to support versioning.

Deletion Vectors

Instead of rewriting files, specific rows are marked as "deleted" — enabling row-level versioning.

🧬 Deep vs Shallow Clone

Feature	Shallow Clone	Deep Clone
Data Copy	❌ No	✅ Yes
Use Case	Temporary testing	Backups, safe duplicates

🔁 Incremental Loading with Auto Loader

Use Auto Loader for real-time ingestion.
Save schema in a dedicated location for schema evolution.
Use checkpointLocation to avoid duplicates.
Use trigger = processingTime for continuous stream.

⚙️ Databricks Workflows

Automate your data pipelines with Databricks Workflows:

Chain multiple notebook tasks
Use UI-based DAG editor
Schedule and trigger based on events

🧠 Final Thoughts

Unity Catalog is a must-have for any serious data platform built on Azure Databricks. It offers robust governance, scalable architecture, and seamless integration with Delta Lake and real-time data streams.

If you're starting your data governance journey, Unity Catalog should be at the top of your list.

Let me know your thoughts in the comments or connect with me on LinkedIn. Happy to dive deeper into any part!

🏷️ #databricks #azure #unitycatalog #dataengineering #deltalake #bigdata #streaming #devops

Azure Data Factory

Indrasen — Sat, 08 Feb 2025 12:23:54 +0000

What is Azure Data Factory?
Azure data factory or ADF is an ETL(Extract-Transform-load ) Tool to integrate data from various sources of various sizes and formats together, in short, It is a serverless, fully managed data integration solution to ingest, and prepare, and transform all of your data at scale. The pipelines of Azure data factory are used to transfer the data from on-premises to the cloud within an interval of certain periods.
Azure data factory will help you automate and manage the data workflow that is being transferred from on-premises and cloud-based data sources and destinations. Azure data factory manages the data-driven workflow pipelines. The Azure data factory stands out from other ETL tools because of attributes such as easy-to-use, Cost-Effective solution, and Powerful and intelligent code-free service.
As data is increasing day by day around the world, many businesses and companies are shifting towards the implementation of cloud-based technology to make their business scalable. Because of the increase in cloud adoption, there is a need for reliable ETL tools in the cloud to make the integration

How does Azure Data Factory Work?
With a graphical interface, ADF enables the creation of complex ETL (Extract, Transform, Load) processes easily that can bring together data from various sources and formats. Below are the some of the key points about Azure Data Factory:
•Data Ingestion: Azure Data Factory can connect to a wide range of data sources like on-premises databases, cloud based storage devices.
•Data Transformation: By mapping the data flow and increasing various transformation activities ADF can clean, aggregate and transform the data to meet the business requirements using Azure services like Azure Databricks or Azure HDinsights.
•Scheduling and Monitoring: It enables strong scheduling capabilities to automate the workflows and monitor the tools for tracking the data pipeline progress and health.

Azure Data Factory(ADF) Architecture

Simple/high level Architecture of Azure Data Factory:

detailed overview of the complete Data Factory architecture:
[https://learn.microsoft.com/en-us/azure/data-factory/media/introduction/data-factory-visual-guide.png]

note the grey background is of different scenarios, or definition or concepts.

Connect and Collect

Businesses have data in various forms and places (on-premises, cloud, SaaS, databases). ADF makes integration easy by connecting multiple sources and aggregating data to be processed.

Businesses would otherwise keep costly, complicated custom pipelines. ADF does it programmatically with Copy Activity, copying data to Azure Data Lake or Blob Storage to process with Azure Databricks or HDInsight.

Transform & Enrich with Azure Data Factory
Once data is in the cloud, ADF Mapping Data Flows helps process and transform it using Spark, without needing Spark expertise.

For custom transformations, ADF supports external compute services like HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning. 🚀

CI/CD and publish
Data Factory supports end-to-end CI/CD of data pipelines with Azure DevOps and GitHub, allowing incremental development and deployment of ETL processes prior to final publishing. After they are perfected, load data to Azure Data Warehouse, Azure SQL Database, Azure Cosmos DB, or any analytics engine supported by business intelligence tools.

Monitor
Track your data integration pipeline to make sure that it is providing business value. Azure Data Factory natively supports tracking by Azure Monitor, API, PowerShell, logs, and health panels.

Overview of ADF Components:

Top-level concepts
An Azure subscription might have one or more Azure Data Factory instances (or data factories). Azure Data Factory is composed of the following key components:
• Pipelines
• Activities
• Datasets
• Linked services
• Data Flows
• Integration Runtimes
These components work together to provide the platform on which you can compose data-driven workflows with steps to move and transform data.

Pipeline :
A data factory might have one or more pipelines. A pipeline is a logical grouping of activities that performs a unit of work. Together, the activities in a pipeline perform a task. For example, a pipeline can contain a group of activities that ingests data from an Azure blob, and then runs a Hive query on an HDInsight cluster to partition the data.
The benefit of this is that the pipeline allows you to manage the activities as a set instead of managing each one individually. The activities in a pipeline can be chained together to operate sequentially, or they can operate independently in parallel.

Mapping data flows:
Create and manage graphs of data transformation logic that you can use to transform any-sized data. You can build-up a reusable library of data transformation routines and execute those processes in a scaled-out manner from your ADF pipelines. Data Factory will execute your logic on a Spark cluster that spins-up and spins-down when you need it. You won't ever have to manage or maintain clusters.

Activity:
Activities represent a processing step in a pipeline. For example, you might use a copy activity to copy data from one data store to another data store. Similarly, you might use a Hive activity, which runs a Hive query on an Azure HDInsight cluster, to transform or analyze your data. Data Factory supports three types of activities: data movement activities, data transformation activities, and control activities.

Datasets:
Datasets represent data structures within the data stores, which simply point to or reference the data you want to use in your activities as inputs or outputs.

Linked services:
Linked services are much like connection strings, which define the connection information that's needed for Data Factory to connect to external resources. Think of it this way: a linked service defines the connection to the data source, and a dataset represents the structure of the data. For example, an Azure Storage-linked service specifies a connection string to connect to the Azure Storage account. Additionally, an Azure blob dataset specifies the blob container and the folder that contains the data.
Linked services are used for two purposes in Data Factory:

To represent a data store that includes, but isn't limited to, a SQL Server database, Oracle database, file share, or Azure blob storage account. For a list of supported data stores, see the copy activity article.
To represent a compute resource that can host the execution of an activity. For example, the HDInsightHive activity runs on an HDInsight Hadoop cluster. For a list of transformation activities and supported compute environments, see the transform data article.

Integration Runtime:
In Data Factory, an activity defines the action to be performed. A linked service defines a target data store or a compute service. An integration runtime provides the bridge between the activity and linked Services. It's referenced by the linked service or activity, and provides the compute environment where the activity either runs on or gets dispatched from. This way, the activity can be performed in the region closest possible to the target data store or compute service in the most performant way while meeting security and compliance needs.

Triggers:
Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off. There are different types of triggers for different types of events.

Pipeline runs:
A pipeline run is an instance of the pipeline execution. Pipeline runs are typically instantiated by passing the arguments to the parameters that are defined in pipelines. The arguments can be passed manually or within the trigger definition.

Parameters:
Parameters are key-value pairs of read-only configuration.  Parameters are defined in the pipeline. The arguments for the defined parameters are passed during execution from the run context that was created by a trigger or a pipeline that was executed manually. Activities within the pipeline consume the parameter values.
A dataset is a strongly typed parameter and a reusable/referenceable entity. An activity can reference datasets and can consume the properties that are defined in the dataset definition.
A linked service is also a strongly typed parameter that contains the connection information to either a data store or a compute environment. It is also a reusable/referenceable entity.

Control flow
Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or from a trigger. It also includes custom-state passing and looping containers, that is, For-each iterators.

Variables:
Variables can be used inside of pipelines to store temporary values and can also be used in conjunction with parameters to enable passing values between pipelines, data flows, and other activities.

for detail Go to the link: [https://learn.microsoft.com/en-us/azure/data-factory/introduction#connect-and-collect]

Python OOPS basic

Indrasen — Wed, 02 Oct 2024 20:10:24 +0000

In real world everything is Object and every object have 2 things behaviour and attribute. Attribute contains data stored in variable and behaviour is defined as method which is nothing but functions(what's needs to be done)

Imagine there is some kind of phone company manufacturing phone but designed once but created multiple handset. Here design or blueprint is a class and object is a real stuff or entity or instance of class.

class Computer:
    def config(self):
        print("i5, 15gb, 1Tb")

comp1=Computer()
Computer.config()


Output:
D:\Testing Document Sample>C:/Python312/python.exe "d:/Testing Document Sample/calc.py"
Traceback (most recent call last):
  File "d:\Testing Document Sample\calc.py", line 7, in <module>
    Computer.config()
TypeError: Computer.config() missing 1 required positional argument: 'self'

one class have multiple object. this config method will change it's behaviour based on it's object, because different object has different behaviour. If I am calling config, for which object I am calling, like if I say "hey walk" for which object I am calling.
So I have to mention "Hey! Ravi walk" , "Hey! Mukesh walk"
In the same way when I am calling the Computer.config we must say for which object we are talking about, and I am talking about comp1, which means " Hey! I want the config for comp1"
thus writing the update code and run

class Computer:
    def config(self):
        print("i5, 15gb, 1Tb")

comp1=Computer()
Computer.config(comp1)


Output:
D:\Testing Document Sample>C:/Python312/python.exe "d:/Testing Document Sample/calc.py"
i5, 15gb, 1Tb

Here we are calling Computer.config(comp1) where comp1 is passed as arguement to the parameter self under config method. So from there self came up, self is the paramter to pass the object.

If I want to call config for comp2 then. If I run this code we get the same output twice, as we are not changing data as per different object.

class Computer:
    def config(self):
        print("i5, 15gb, 1Tb")

comp1=Computer()
comp2=Computer()
Computer.config(comp1)
Computer.config(comp2)


Output:
D:\Testing Document Sample>C:/Python312/python.exe "d:/Testing Document Sample/calc.py"
i5, 15gb, 1Tb
i5, 15gb, 1Tb

Another to call the method, the first 2 is for Computer.config(comp1) and 2nd two is for comp1.config()

class Computer:
    def config(self):
        print("i5, 15gb, 1Tb")

comp1=Computer()
comp2=Computer()
Computer.config(comp1)
Computer.config(comp2)
comp1.config()
comp2.config()

Output:
D:\Testing Document Sample>C:/Python312/python.exe "d:/Testing Document Sample/calc.py"
i5, 15gb, 1Tb
i5, 15gb, 1Tb
i5, 15gb, 1Tb
i5, 15gb, 1Tb

Now for behind this what's happening. when we use these, as already object is reference with class at time of creation. So behind this scene, config will take comp1/comp2 as arguement and pass to parameter as self. thus most of the code we see this below type of syntax not the old one.

comp1.config()
comp2.config()

ctrl+click bit_length for below example. calling bit_length it will not passing it. so this 'a' is an object it act as arguement and pass it as parameter to self.

eg:
a=3
a.bit_length()

So this is the way that we can create object

Let move further, suppose I want to create different cpu and ram configuration, which require 2 variables, which is cpu and ram. so how to find that.
That's where the picture came special method and variable having underscores (method/variable)

below example as we call config twice the output under init shows twice because we have created 2 objects, which automatically calls the init

class Computer:
    def __init__(self):         //  act as contructor which get called when object created.
        print("initalize init")  
    def config(self):
        print("i5, 15gb, 1Tb")

comp1=Computer()
comp2=Computer()

comp1.config()
comp2.config()


Output:
D:\Testing Document Sample>C:/Python312/python.exe "d:/Testing Document Sample/calc.py"
initalize init
initalize init
i5, 15gb, 1Tb
i5, 15gb, 1Tb

Now Supppose I want to get values based on cpu and ram with respect to object comp1/comp2 so we need to pass the argument mentioning cpu and ram and create variable in init method.
Now since I want cpu and ram values are need to be part of object thus so here is the code for below.

class Computer:
    def __init__(self,cpu,ram):    // self represent object name thus having 3 parameter (self=comp1,cpu='i5',ram=16)
        self.cpu = cpu                // as we need to assign value to object thus using self.cpu=cpu as self is object 
        self.ram = ram
    def config(self):
        print("config is", self.cpu, self.ram)      // Since this CPU and ram is belongs to object thus using self.cpu/self.ram

comp1=Computer('i5',16)
comp2=Computer('Ryzen 3',8)    // passing two arguments to __init__ method

comp1.config()
comp2.config()


Output:
D:\Testing Document Sample>C:/Python312/python.exe "d:/Testing Document Sample/calc.py"
config is i5 16
config is Ryzen 3 8

Thus for the above the data and methods work together i.e binding data with every method so one object has it's own method and has it's own variable.

Comp1 is object or referring to an cbject. I our system we have a special memory called Heap memory. Heap memory will contain all objects. so I might have some address in heap memory the way to print the address is :

class computer:
    pass
c1=computer()

print(id(c1))

Output:
D:\Testing Document Sample>C:/Python312/python.exe "d:/Testing Document Sample/demo.py"
2509775964640     // this is an address

What will be size of object?
It's depend upon the size of variable and number of variable

who allocate the memory to object ?
Constructor

we can also change the value of one object from different object

Example:

class computer:
    def __init__(self):
        self.name = "navin"
        self.age = 28

c1=computer()
c2=computer()

c1.name="rashi"
c2.age= 12

print (c1.name)
print (c2.name)
print (c1.age)
print (c2.age)

Output:
D:\Testing Document Sample>C:/Python312/python.exe "d:/Testing Document Sample/demo.py"
rashi
navin
28   
12

in the above code we can not only assign the different values to object explicitly but can change it also

Why do we need this self?
To explain that lets take one more method called update which is used to update the age
When update is called it will update the age in class computer, but here is catch we have two objects which value needs to be changed, as we called it we have not passed any arguement c1 or c2.
So when calling update how do pointer know which object we are calling about weather c1 age or c2 age and that's where we are using self is directing to c1 and c2 based on calling by object. If we are using c1.update then in the bracket it will pass c1, then self will assign to c1, that's the important of self which is referring to the object.

class computer:
    def __init__(self):
        self.name = "navin"
        self.age = 28

    def update(self):
        self.age = 30


c1=computer()
c2=computer()

c1.name="rashi"
c2.age= 12

c1.update()      // this calls the update method. In the 

print (c1.name)
print (c2.name)
print (c1.age)
print (c2.age)


Output:
D:\Testing Document Sample>C:/Python312/python.exe "d:/Testing Document Sample/demo.py"
rashi
navin
30
12

Let's compare the object for above: In below python don't know how to compare object. thus In this I will use seperate function to do that.

class computer:
    def __init__(self):
        self.name = "navin"
        self.age = 28

#   def update(self):
#      self.age = 30

    def compare(self,other):     // here c1 becomes self as object used to call and c2 become others passed as arguement
        if self.age==other.age:
            return true
        else:
            return false

c1=computer()
c1.age=30
c2=computer()

# if c1==c2:            // I don't want to compare address but want to compare their values(age) name doesn't matter

if c1.compare(c2):     // Since it's not an in build function thus we need to create it 
    print("They are same")

else:
    print("They are Different")

#c1.update()

print (c1.age)
print (c2.age)


Output:
D:\Testing Document Sample>C:/Python312/python.exe "d:/Testing Document Sample/demo.py"
They are Different
30
28

Types of variable in OOPS

Instance Variables (Object Attributes)

Definition: These variables are tied to a specific instance of a class. Each object of the class can have different values for these variables.
Scope: They are accessible only within the instance they belong to, not across all instances of the class.
How to Define: Defined inside the constructor method (init()) and are prefixed with self.

Example:

class Car:
    def __init__(self, model, color):
        self.model = model  # Instance variable
        self.color = color  # Instance variable

car1 = Car("Tesla", "Red")
car2 = Car("BMW", "Blue")

print(car1.model)  # Output: Tesla
print(car2.model)  # Output: BMW

Here, model and color are instance variables that are specific to each instance of the Car class.

Class Variables (Static Variables)

Definition: These are variables that are shared across all instances of the class. They are defined inside the class but outside of any instance methods.
Scope: They belong to the class itself and not to individual objects. Changes made to class variables affect all instances of the class.
How to Define: Declared within the class, but outside of any method. They are accessed using the class name or through any instance of the class.

Example:

class Car:
    wheels = 4  # Class variable

    def __init__(self, model, color):
        self.model = model
        self.color = color

car1 = Car("Tesla", "Red")
car2 = Car("BMW", "Blue")

print(Car.wheels)  # Output: 4 (Accessed through class)
print(car1.wheels)  # Output: 4 (Accessed through an instance)

Here, wheels is a class variable shared by all instances of the Car class. Both car1 and car2 will have access to this variable.

Instance Methods (Method Variables)

Definition: Variables declared inside instance methods (other than init) are method variables, and they are local to that specific method.
Scope: These variables are created and exist only within the method where they are defined. They are not accessible outside the method.
How to Define: Defined inside any method of a class using standard variable declaration without self.

Example:

class Car:
    def display_info(self):
        speed = 120  # Method variable
        print(f"Speed is {speed}")

car1 = Car()
car1.display_info()  # Output: Speed is 120

Here, speed is a method variable and is local to the display_info() method. It cannot be accessed outside this method.

Static Methods Variables

Definition: Variables defined inside static methods are local to that method, similar to instance methods. Static methods don’t have access to instance (self) or class variables unless passed explicitly.
Scope: Local to the static method.
How to Define: Defined inside a static method, typically declared with the @staticmethod decorator.

Example:

class Car:
    wheels = 4  # Class variable

    @staticmethod
    def show_description():
        description = "This is a car"  # Static method variable
        print(description)

Car.show_description()  # Output: This is a car

Here, description is a variable local to the static method show_description() and cannot be accessed outside this method.

Class Methods Variables

Definition: Variables defined inside class methods are similar to static method variables, but class methods can access and modify class variables. Class methods are defined using the @classmethod decorator.
Scope: Variables are local to the class method, but the method has access to class variables via the cls parameter.
How to Define: Declared within class methods, typically with the @classmethod decorator.

Example:

class Car:
    wheels = 4  # Class variable

    @classmethod
    def set_wheels(cls, count):
        cls.wheels = count  # Class method variable

Car.set_wheels(6)
print(Car.wheels)  # Output: 6

Here, count is a variable within the class method set_wheels, and the method can access and modify the class variable wheels.

Key Difference :

Instance Variables

Definition: Variables that are unique to each object (instance) of a class.
Scope: Available only within the instance they belong to.
Access: Accessed via self (i.e., self.variable_name).
Storage: Stored within each object; each instance has its own copy of instance variables.
Lifecycle: Created when an object is instantiated and destroyed when the object is deleted.
Modification: Each object can have its own unique values for instance variables.
Key Point: Instance variables are unique to each instance. For example, different Car objects can have different model and color.

Class Variables

Definition: Variables that are shared across all instances of a class.
Scope: Belongs to the class itself and is shared by all instances.
Access: Accessed using ClassName.variable_name or via instances (though the preferred way is through the class name).
Storage: Stored in the class, not in the individual object instances.
Lifecycle: Created when the class is defined and exists for the lifetime of the class.
Modification: Modifying a class variable through the class affects all instances, whereas modifying it through an instance creates a new instance variable for that particular object.
Key Point: Class variables are shared across all instances. If wheels is changed through the class, it affects all instances unless overridden at the instance level.

Method Variables (Local Variables)

Definition: Variables defined inside a method, and they are local to that method.
Scope: Available only within the method where they are defined.
Access: Can only be accessed within the method; not accessible outside the method or across instances.
Storage: Stored temporarily in the method call stack and are discarded once the method execution finishes.
Lifecycle: Created when the method is called and destroyed once the method finishes execution.
Modification: Each method call gets a fresh set of method variables.
Key Point: Method variables are local to the method and do not persist beyond the method's execution.

Static Variables (Variables in Static Methods)

Definition: Variables defined inside a static method, local to the method itself, similar to method variables.
Scope: Confined to the static method where they are defined.
Access: Static methods do not have access to class or instance variables unless they are passed as arguments.
Storage: Stored temporarily during the execution of the static method.
Lifecycle: Created when the static method is called and destroyed once it finishes execution.
Modification: Modifying static method variables only affects that specific method execution.
Key Point: Static method variables are local to the static method and do not interact with class or instance variables unless explicitly passed.

Types of Methods in oops:

Instance Methods:

Definition: Instance methods are the most common type of method in a class. They operate on instances of the class (i.e., objects) and can access or modify object attributes.
How to Define: Instance methods must have self as their first parameter, which refers to the instance of the class. They can access both instance variables and class variables.
Use Case: These methods are used to manipulate object-specific data.

Example:

class Car:
    def __init__(self, model, color):
        self.model = model  # Instance variable
        self.color = color  # Instance variable

    def display_info(self):  # Instance method
        print(f"Car model: {self.model}, Color: {self.color}")

car1 = Car("Tesla", "Red")
car1.display_info()  # Output: Car model: Tesla, Color: Red

Here, display_info is an instance method that operates on the instance car1 and accesses its instance variables (model and color).

Class Methods

Definition: Class methods are methods that operate on the class itself rather than on instances of the class. They can access or modify class variables but cannot modify instance variables.
How to Define: Class methods are defined using the @classmethod decorator, and they take cls as their first parameter, which refers to the class itself.
Use Case: These methods are used to work with class-level data and can be called without creating an instance of the class.

Example
class Car:
    wheels = 4  # Class variable

    @classmethod
    def change_wheels(cls, count):  # Class method
        cls.wheels = count

Car.change_wheels(6)
print(Car.wheels)  # Output: 6

In this example, change_wheels is a class method that modifies the class variable wheels and can be called on the class itself, not on individual objects.

Static Methods:

Definition: Static methods are methods that don’t operate on instances or the class itself. They do not take self or cls as their first parameter and cannot modify object state or class state. Static methods behave like regular functions but are bound to a class for organizational purposes.
How to Define: Static methods are defined using the @staticmethod decorator.
Use Case: Static methods are used when you need a function that logically belongs to the class but doesn’t need to access or modify any class or instance variables.

Example:

class Car:
    @staticmethod
    def is_motorized():  # Static method
        return True

print(Car.is_motorized())  # Output: True

Here, is_motorized is a static method that doesn’t rely on any class or instance data and can be called directly on the class.

Key Differences:

Instance Methods:
Access both instance and class variables.
The self parameter allows access to object-specific data.

Class Methods:
Access only class variables.
The cls parameter allows access to class-level data and can modify it.

Static Methods:
Don’t access or modify class or instance variables.
Behave like regular functions, but are logically grouped inside a class.

Basic of Docker

Indrasen — Sun, 29 Sep 2024 19:30:20 +0000

Docker is an open-source platform that automates the deployment, scaling, and management of applications within lightweight, portable containers. Containers package an application and its dependencies into a single unit, ensuring that it runs consistently across different environments, from development to production.

Key Features:

Containerization: Docker allows developers to package applications with all their dependencies, which helps avoid compatibility issues between different environments.

Isolation: Each container runs in its own isolated environment, enabling developers to run multiple applications on the same host without conflicts.

Portability: Docker containers can run on any system that has Docker installed, making it easy to move applications between environments such as local development, testing, and production.

Efficiency: Containers share the host system's kernel, which makes them lightweight and fast compared to traditional virtual machines.

Ecosystem: Docker has a rich ecosystem with tools like Docker Hub (a repository for sharing images) and Docker Compose (for managing multi-container applications), enhancing its functionality for developers.

Learning Docker can greatly improve productivity and streamline the development process, making it an essential skill for modern software development

Note create VM in Azure or AWS some other cloud platform to avoid installation in local.

Here is the block which can be used to run the docker commands

Docker commands 
--------------------------


docker run -it --name new_contatiner ubuntu /bin/bash        

 When you run this command:
Docker checks if the Ubuntu image is available locally. If not, it downloads the image from Docker Hub.It then creates a new container from the Ubuntu image. The -it flags make sure that you can interact with the shell.A Bash shell is started within the container, allowing you to execute commands as if you were using a regular Ubuntu terminal.





docker images

all Docker images available on your local machine





docker ps

check which containers are currently running on your system , add -a to check all containers





docker pull image
used to download the official Docker image from Docker Hub to your local machine.  image can be unbuntu, jenkins or any setup






docker start [container]
is used to start a stopped container. Ensure you have the correct name or ID, and check the container’s status after starting it.




docker stop [container]
is used to stop a started container. Ensure you have the correct name or ID, and check the container’s status after stopping it.




docker attach [container]
 command is used to connect your terminal to a running container's standard input, output, and error streams. This allows you to interact with the container as if you were directly accessing its terminal.Use it carefully, as it can affect the running processes within the container.





 docker exec -it [container_name_or_id] /bin/bash
 is used to start a new interactive shell session inside a running Docker container.docker exec is generally more versatile for administrative tasks, whereas docker attach is more suited for direct interaction with the primary process.




docker rm [container]  ;   docker rm container1 container2 container3 ;  docker container prune
command will only work on stopped containers.  you can use the -f flag to forcefully remove a running container.  2nd cmd is remove multiple container. 3rd is to remove all stopped containers






docker commit [OPTIONS] CONTAINER NEW_IMAGE_NAME    
docker commit sonucontianer updateimage ;
commit a Docker container and create a new image from it, you can use the docker commit command. This command saves the changes made to a container's filesystem as a new image.






docker diff [container name]  
command is used to inspect changes made to a container's filesystem since it was started. It shows what has been added, modified, or deleted in the container compared to its original state.



------------------------------------------------------------------------------------------------------------------------------------------------------------------

File System:  Docker file components and difference command 


A Dockerfile is a text file that contains a set of instructions used to build a Docker image. These instructions define how to configure and package the software, libraries, and dependencies required for the application you want to containerize.




In Docker, parser directives are special instructions placed at the beginning of a Dockerfile that influence how the file is processed by the Docker engine. These directives allow you to control certain behaviors, such as the default shell to be used for RUN commands or encoding for the Dockerfile itself.







# syntax

Purpose: Specifies the syntax or the version of the build backend to use, which allows features from specific versions of Docker BuildKit.
Usage: This directive helps specify which build system or additional features should be supported during the build process.

 syntax=docker/dockerfile:1.2


# syntax=docker/dockerfile:1.3
FROM ubuntu:20.04
RUN echo "Using BuildKit features from Dockerfile syntax version 1.3"

This Dockerfile specifies the use of Docker BuildKit features from version 1.3.






# escape

Purpose: Specifies the escape character to be used for line continuations in the Dockerfile. By default, Docker uses \ for line continuation, but you can change it to another character like backtick (`).
Usage: This directive is useful for Windows-based containers where the backslash \ is commonly used in file paths.
# escape=`

# escape=`
FROM mcr.microsoft.com/windows/servercore:ltsc2022
RUN dir C:\Windows `
    && echo "Line continuation using backtick (`)"

In this example, the escape directive changes the line continuation character to a backtick, which is more suitable for Windows containers using Windows-style paths.



Without the # escape directive, Docker defaults to using \ as the line continuation character.
Without the # syntax directive, Docker uses the default syntax compatible with the current Docker version.




ARG             Use build-time variables.
FROM            Create a new build stage from a base image.
RUN             Execute build commands.
MAINTAINER      Specify the author of an image.
COPY            Copy files and directories.
ADD             Add local or remote files and directories.
EXPOSE          Describe which ports your application is listening on.
WORKDIR         Change working directory.
CMD             Specify default commands.
ENTRYPOINT      Specify default executable.
ENV             Set environment variables.





ARG variables are only accessible during the build phase of a Docker image and are not available during the runtime of a container. Once the build is complete, the ARG values are discarded and cannot be accessed inside a running container. If you need access to the values during the runtime of the container, you will need to pass them as environment variables using ENV or provide runtime options via docker run.





Create a Dockerfile

vi Dockerfile


FROM ubuntu
RUN echo "My name is Indra Singh" > /tmp/testfile

saveit in vm press esc---> wq  enter

run command    :    Docker build -t test .
again create container from the above image and check the /tmp directory, you will get the file





Another example

Create a Dockerfile

vi Dockerfile


FROM ubuntu
WORKDIR /tmp
RUN echo "My name is Indra Singh" > /tmp/testfile
ENV myname IndraSingh
COPY testfile1 /tmp
ADD test.tar.gz /tmp


saveit in vm press esc---> wq  enter

touch testfile1                 # create a testfile1
touch test                      # create a test
tar -cvf test.tar test          #convert into test.tar
gzip test.tar                   #zip to gz for test.tar to create test.tar.gz
rm -rf test                     #remove test file as test.tar.gz file created
exit                            # exit from container


dock build -t newimage1                           # create image from docker file
docker run -it --name newcontatiner newimage1     # create and run the container from newimage1
check all the files 




-----------------------------------------------------------------------------------------------------------------------------------------------------------------
#Docker Volume


#create a file and dockerfile
touch file1 file2 Dockerfile


#edit docker file:
vi Dockerfile


#enter the commmand  in docker file:
FROM Ubuntu
VOLUME [/"myvolume"]

press esc
:wq enter



#create image from file 
docker build -t myimage .



#check docker images:
docker images



#create container from myimage:
docker run -it --name contatiner newimage /bin/bash




#create file inside myvolume then exit from  contianer
touch filex filey filez
exit




#creates a new container named container2 from the Ubuntu image, runs in privileged mode, granting it more permissions.It shares volumes with an existing container (contatiner)
After starting the container, it drops you into an interactive bash shell within the Ubuntu environment of the container.

docker run -it --name container2 --privileged=true --volumes-from contatiner ubuntu /bin/bash





#create volume using command 
docker run -it --name contatiner3 -v /volume2 ubuntu /bin/bash



#same as above shares volumes with an existing container(container3) and creating new container (container4) 
docker run -it --name container4 --privileged=true --volumes-from contatiner3 ubuntu /bin/bash



#Map host volume to container
docker run -it --name hostcont -v /home/azureuser:/sonu --privileged=true --volumes-from contatiner3 ubuntu /bin/bash




-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Port Mapping (request made on local(host) forwarded to port on container)

docker run -td --name techserver -p 80:80 ubuntu

#The command creates and starts a new Docker container called techserver in the background, using the official ubuntu image. It allocates a terminal (-t), runs in detached mode (-d), and maps port 80 on the host to port 80 inside the container (-p 80:80). This setup is typically used to run a web server or another service that listens on port 80.



example

apt-get install apache2 -y
cd /var/www/html
echo "subscribe indra technical">index.html
service apache2 restart 

# check public ip with port :80 we will get the out put as "subscribe indra technical".  Note inbound rule should be set for port 80.




docker run -td --name myjenkins -p 8080:8080 jenkins/jenkins:lts
# this command runs Jenkins in a detached Docker container, names it myjenkins, and makes Jenkins accessible via port 8080 on your machine.  Note inbound rule should be set for port 8080.

Video Tutorial linksVideo Tutorial links

github code link

Basic Python

Indrasen — Sun, 15 Sep 2024 19:14:03 +0000

Python are used to store and reference various types of data, such as human nouns that refer to people, places, or things. Python has five main data types: numbers, strings, lists, and dictionaries (such as so-called called dicts) and Booleans, these data types are similar in many programming languages. Although they may have different names (for example, lists in Python are called arrays in JavaScript), prominent punctuation and symbols for each data type in Python make it easier to identify. And syntax highlighting in code editors also helps differentiate them.

example

Example 2-1. parts_of_speech.py
# a number is just digits
25
# a string is anything surrounded by matching quotation marks
"Hello World"
# a list is surrounded by square brackets, with commas between items

# note that in Python, the first item in a list is considered to be
# in position `0`, the next in position `1`, and so on
["this","is",1,"list"]
# a dict is a set of key:value pairs, separated by commas and surrounded
# by curly braces
{"title":"Practical Python for Data Wrangling and Data Quality",
 "format": "book",
 "author": "Susan E. McGregor"
}
# a boolean is a data type that has only two values, true and false.
True

Naming a Python variable

Example 2-2. Naming a Python variable
author = "Susan E. McGregor"

This code tells the computer to set aside a box in memory, label it author, and then
put the string "Susan E. McGregor" into that box. Later on in our program, if we
asked the computer about the author variable, it would tell us that it contains the
string "Susan E. McGregor"

Example 2-3. Printing the contents of a Python variable

# create a variable named author, set its contents to "Susan E. McGregor"
author = "Susan E. McGregor"
# confirm that the computer "remembers" what's in the `author` variable
print(author)

Verbs ≈ Functions

Functions and methods in Python are like verbs in the language. Functions such as print() are implemented and available in the language. While methods are functions associated with a specific data type, for example the split() method works with strings to do things like split text. Built-in functions are simple to use by calling them with the required arguments. In parentheses, however, methods must be attached to specific variables or characters of the relevant data type. Because it's designed to work with that type (e.g. strings have methods like split() but numbers don't).

In the case of the split() method, however, we have to attach the method to a specific
string. That string can either be a literal (that is, a series of characters surrounded
by quotation marks), or it can be a variable whose value is a string. Try the code in
Example 2-5 in a standalone file or notebook, and see what kind of output you get!

refer example in git hub provided link:
example 2-5 & 2-6 in ipynb file

In Python, a user-defined function is a function created by the user to perform specific tasks. Functions allow for code reuse and make programs more modular and organized.
example 2-7 in ipynb file




def greet_me(a_name):
 print("Hello "+a_name)

The def keyword is used to define a function.
Parentheses after the function name indicate that it is a function and hold any parameters.
The colon (:) marks the beginning of the indented body of work.
To get the arguments passed to the function Use the local parameter name in parentheses.

Loops

A loop in Python is a flow control statement that can execute code repeatedly based on a condition or sequence of elements. There are two main types of loops in Python: for loops and while loops.

Types of Loops:

for Loop:

Used to iterate over a sequence (like a list, string, or range) and execute a block of code for each item in the sequence.
It automatically moves to the next item in the sequence after executing the code.

while Loop:

Repeats a block of code as long as a specified condition is True.
It continues looping until the condition becomes False

Loop Control Statements:

break: Terminates the loop prematurely.
continue: Skips the current iteration and moves to the next one.
pass: Does nothing, used as a placeholder for future code.

example 2-8 to 2-10 in ipynb file

If- else condition

You can use if-else conditions inside loops in Python to perform different actions based on conditions during each iteration of the loop. Both for and while loops can be combined with if-else for control flow.

example 2-11 & 2-12 in ipynb file

Python Data Wrangling and Data Quality

Indrasen — Sun, 15 Sep 2024 14:53:45 +0000

What is Data Wrangling and Data Quality and why it's important?

Data Wrangling

Data Wrangling is the process of transforming raw or received data into a format that can be analyzed to create insights. This involves making decisions about the quality of the data. This is because most of the available data is not of high quality. This process is more than programming and data manipulation. Decisions and selections need to be made that affect the final data set.

Important steps in the data dispute process include:

Searching or storing information
After checking the information
cleaning Standardizing, correcting, and updating data
data analysis
Data display

Data quality

Data quality refers to the reliability and accuracy of the data. This is critical to gaining meaningful insights. Not all data is of the same quality. And poor quality data leads to flawed conclusions. Monitoring data quality is an important part of data disputes.

Although computers are powerful But he only obeyed human orders. and is limited to matching patterns only based on the information provided. Humans play a key role in data collection, analysis, and quality assurance. This is because computers cannot make creative decisions or understand context.

Data quality assessment has two main points:

Data Integrity – How accurate and reliable is the data?
Fit for purpose - whether the information is appropriate for the specific question or problem being solved.

What is Data integrity?

Data integrity refers to the quality and reliability of data values and descriptors in a dataset. in evaluating completeness Consider whether the measurement will be carried out regularly. Represents individual readings or averages. And is there a database that explains how the data should be stored or interpreted (e.g. relevant units)?

What is data fit?

Data "fit" refers to how well a dataset fits a specific purpose or query. Although the dataset is highly complete, But if it does not meet the needs of the analysis It may not be useful, for example real-time Citi Bike data may be of good quality. But it's not suitable for answering questions about how bike stations change from day to day. Citi Bike travel history information would be more appropriate...

Determining the suitability of data often requires an assessment of its completeness. Shortcuts to this process can affect the quality of the analysis and lead to incorrect conclusions. Problems with appropriate data, such as using income data to answer questions about education. It can distort findings and lead to dangerous results. Although sometimes the use of proxy measures may be necessary, Especially in urgent situations But doing so on a large scale can amplify errors. and distort the real-world phenomena that the data is intended to describe…

The completeness and appropriateness of the data are carefully assessed to prevent these errors.

High-integrity data is complete, atomic, and well-annotated. This allows for more detailed analysis. However, many datasets lack these features. And it is up to analysts to understand and improve on these limitations. They often search for additional information or consult experts who are familiar with the data set or field of study..

Check out this url and use pdf and ipynb file github