DEV Community: Lena Hall

Unleash the Potential of PowerApps with BigData

Lena Hall — Thu, 09 Jul 2020 17:06:41 +0000

#PowerfulDevs Conference

RSVP HERE BY JULY 15:

The Powerful DEVs Conference is the first virtual conference of its kind. We will showcase how developers can leverage the Power Platform to build applications faster and with far less effort. Connect with industry-recognized ProDev influencers, Microsoft Cloud Advocates, trusted and diverse community leaders, and members of the Power Platform Team.

BOOKMARK THIS FOR RESOURCES AND DISCUSSION WITH SPEAKERS

#powerfuldevs Conference: Join us on July 15th Online!

JennyMEvents for Microsoft Azure ・ Jul 10 '20

#powerfuldevs #powerplatform #prodev #programming

Revisit this page during the event to engage in live (and post-event) discussions on those topics with both speakers and community. The speakers will be here for a live Q&A for at least 30 minutes immediately after their session concludes. After the event, come back to find additional slides, videos, and resources for this session.

About This Session:

July 15, 2020: 15:00 PDT - 15:25 PDT

Learn how to extract value from your data to bring the impact of your low-code solutions to a whole new level. PowerApps already enable creation of useful business applications with minimal effort. In this session, you will learn about how and why to connect your applications to Azure services responsible for Big Data. You will see an example of an application that keeps track of NYC taxi logs and provides logistical information for greater business insights. You will leave this session with confident understanding of what Big Data connection options PowerApps provide, how to connect your application to Big Data, as well as how to reference and visualize it.

About the Speakers:

Lena Hall is a Principal Software Engineer @Microsoft and the Team Lead for Big Data Services. Follow @lenadroid on Twitter.

Additional Resources

Dev'ing for Power Apps session

DevTo page: https://aka.ms/PowerAppsDevs

Applied Cloud Stories: Winning Entries

Lena Hall — Tue, 16 Jun 2020 18:45:38 +0000

In January this year, we launched Applied Cloud Stories initiative - a call for new content created by independent community members, focusing on practical stories about scenarios and workloads that can run on Azure.

Over the last couple of months, we were fortunate to receive a number of outstanding community stories. Many of them shared lessons learned, trade-offs, tips and tricks, and valuable experience. We are grateful for every single story we received from you!

We are absolutely delighted to share the winners of Applied Cloud Stories initiative!

✨ Machine Learning and Data Science ✨

"Optimize Azure ML IoT for Production Usage" by Valentyn Logvinskiy.

"Over the last few years IoT devices and ML/AI have become very popular, and now a lot of companies are moving forward to use them in production. All cloud providers, including Microsoft Azure, provides services how to deploy developed machine learning algorithms to the edge device. The main concern of some industries (automotive, agriculture, etc.) is that in production the cost for data transfer, out of the total cost of ownership, will be huge."

"Let's take a look at how Azure ML IoT works and when reducing the data transfer matters."

💬 Reviewer's Note 💬

The reviewers highlighted that the author does a superb work with using different technologies on Azure and applying them to solve a real technical challenge and a business problem. The fact that he demonstrates how to reduce the size for the Docker image layers, which need to be transferred to the IoT device is very unique and has a visible business impact: the cost of the update of the model in production decreases exponentially and other developers and data scientists can leverage this lesson to operationalize their IoT solutions while maintaining costs low.

✨ DevOps and Infrastructure ✨

Deploy a Test Environment With a Calendar Appointment by Barbara Forbes.

"How to find an automated and easy way to create a non-production environment?"

"In most cases, the ongoing situation is that developers are sharing VMs with the needed applications, or they use their own workstation. While this works, it is far from ideal. I found a solution, using PowerShell, ARM Templates and Azure Serverless services. In this post I want to talk about how to deploy a test environment with a calendar appointment."

💬 Reviewer's Note 💬

Our reviewers found the idea and the concept described in the article to be interesting and original! They appreciated that the author technically shows how to use "unexpected" APIs integrations to trigger various cloud actions.

✨ Applications ✨

CloudSkew Architecture by Mithun Shanbhag.

"CloudSkew's infrastructure has been built on top of various Azure services - snapped together like lego blocks."

"Deep-dive on CloudSkew's building blocks discussing the lessons learnt, key decisions & trade offs made."

💬 Reviewer's Note 💬

The reviewers noted the wide spectrum of topics and technologies covered in this amazing article. Incredible covering of monitoring and incident management, things like manual approvals, and many more. The author also focused on technology choices and tradeoffs, such as PaaS vs Kubernetes.

✨ Research ✨

Azure Notebooks And Cognitive Services Within An University Class by Pascal.

"This story shows how we used Azure Notebooks for providing an interactive learning experience in class."

💬 Reviewer's Note 💬

The reviewers agreed the story is a nice use case for academic audiences. It shows how easy is it to use Azure to teach students.

What's Next?

Congratulations to authors of the winning stories!

Over the next days, we will be reaching out to the winners and to authors of submission that Applied Cloud Stories Committee chose to feature.

We look forward to publishing winning and featured stories on Microsoft content properties and sharing with you very soon! We are incredibly grateful for hearing all the amazing community stories.

Large-scale Data Analytics with Azure Synapse - Workspaces with CLI

Lena Hall — Wed, 20 May 2020 17:20:09 +0000

One of the challenges of large scale data analysis is being able to get the value from data with least effort. Doing that often involves multiple stages: provisioning infrastructure, accessing or moving data, transforming or filtering data, analyzing and learning from data, automating the data pipelines, connecting with other services that provide input or consume the output data, and more. There are quite a few tools available to solve these questions, but it's usually difficult to have them all in one place and easily connected.

If this article was helpful or interesting to you, follow @lenadroid on Twitter.

Introduction

This is the first article in this series, which will cover what Azure Synapse is and how to start using it with Azure CLI. Make sure your Azure CLI is installed and up-to-date, and add a synapse extension if necessary:

$ az extension add --name synapse

What is Azure Synapse?
In Azure, we have Synapse Analytics service, which aims to provide managed support for distributed data analysis workloads with less friction. If you're coming from GCP or AWS background, Azure Synapse alternatives in other clouds are products like BigQuery or Redshift. Azure Synapse is currently in public preview.

Serverless and provisioned capacity
In the world of large-scale data processing and analytics, things like autoscale clusters and pay-for-what-you-use has become a must-have. In Azure Synapse, you can choose between serverless and provisioned capacity, depending on whether you need to be flexible and adjust to bursts, or have a predictable resource load.

Native Apache Spark support
Apache Spark has demonstrated its power in data processing for both batch and real-time streaming models. It offers a great Python and Scala/Java support for data operations at large scale. Azure Synapse provides built-in support for data analytics using Apache Spark. It's possible to create an Apache Spark pool, upload Spark jobs, or create Spark notebooks for experimenting with the data.

SQL support
In addition to Apache Spark support, Azure Synapse has excellent support for data analytics with SQL.

Other features
Azure Synapse provides smooth integration with Azure Machine Learning and Spark ML. It enables convenient data ingestion and export using Azure Data Factory, which connects with many Azure and independent data input and output sources. Data can be effectively visualized with PowerBI.

At Microsoft Build 2020, Satya Nadella announced Synapse Link functionality that will help get insights from real-time transactional data stored in operational databases (e.g. Cosmos DB) with a single click, without the need to manage data movement.

Get started with Azure Synapse Workspaces using Azure CLI

Prepare the necessary environment variables:

$ StorageAccountName='<come up with a name for your storage account>'
$ ResourceGroup='<come up with a name for your resource group>'
$ Region='<come up with a name of the region, e.g. eastus>'
$ FileShareName='<come up with a name of the storage file share>'
$ SynapseWorkspaceName='<come up with a name for Synapse Workspace>'
$ SqlUser='<come up with a username>'
$ SqlPassword='<come up with a secure password>'

Create a resource group as a container for your resources:

$ az group create --name $ResourceGroup --location $Region

Create a Data Lake storage account:

$ az storage account create \
  --name $StorageAccountName \
  --resource-group $ResourceGroup \
  --location $Region \
  --sku Standard_GRS \
  --kind StorageV2

The output of this command will be similar to:

{- Finished ..
  "accessTier": "Hot",
  "creationTime": "2020-05-19T01:32:42.434045+00:00",
  "customDomain": null,
  "enableAzureFilesAadIntegration": null,
  "enableHttpsTrafficOnly": false,
  "encryption": {
    "keySource": "Microsoft.Storage",
    "keyVaultProperties": null,
    "services": {
      "blob": {
        "enabled": true,
        "lastEnabledTime": "2020-05-19T01:32:42.496550+00:00"
      },
      "file": {
        "enabled": true,
        "lastEnabledTime": "2020-05-19T01:32:42.496550+00:00"
      },
      "queue": null,
      "table": null
    }
  },
  "failoverInProgress": null,
  "geoReplicationStats": null,
  "id": "/subscriptions/<subscription-id>/resourceGroups/Synapse-test/providers/Microsoft.Storage/storageAccounts/<storage-account-name>",
  "identity": null,
  "isHnsEnabled": null,
  "kind": "StorageV2",
  "lastGeoFailoverTime": null,
  "location": "eastus",
  "name": "<storage-account-name>",
  "networkRuleSet": {
    "bypass": "AzureServices",
    "defaultAction": "Allow",
    "ipRules": [],
    "virtualNetworkRules": []
  },
  "primaryEndpoints": {
    "blob": "https://<storage-account-name>.blob.core.windows.net/",
    "dfs": "https://<storage-account-name>.dfs.core.windows.net/",
    "file": "https://<storage-account-name>.file.core.windows.net/",
    "queue": "https://<storage-account-name>.queue.core.windows.net/",
    "table": "https://<storage-account-name>.table.core.windows.net/",
    "web": "https://<storage-account-name>.z13.web.core.windows.net/"
  },
  "primaryLocation": "eastus",
  "provisioningState": "Succeeded",
  "resourceGroup": "<resource-group-name>",
  "secondaryEndpoints": null,
  "secondaryLocation": "westus",
  "sku": {
    "capabilities": null,
    "kind": null,
    "locations": null,
    "name": "Standard_GRS",
    "resourceType": null,
    "restrictions": null,
    "tier": "Standard"
  },
  "statusOfPrimary": "available",
  "statusOfSecondary": "available",
  "tags": {},
  "type": "Microsoft.Storage/storageAccounts"
}

Retrieve the storage account key:

$ StorageAccountKey=$(az storage account keys list \
  --account-name $StorageAccountName \
  | jq -r '.[0] | .value')

Retrieve Storage Endpoint URL:

$ StorageEndpointUrl=$(az storage account show \
  --name $StorageAccountName \
  --resource-group $ResourceGroup \
  | jq -r '.primaryEndpoints | .dfs')

You can always check what your storage account key and endpoint are by looking at them, if you'd like:

$ echo "Storage Account Key: $StorageAccountKey"
$ echo "Storage Endpoint URL: $StorageEndpointUrl"

Create a fileshare:

$ az storage share create \
  --account-name $StorageAccountName \
  --account-key $StorageAccountKey \
  --name $FileShareName

Create a Synapse Workspace:

$ az synapse workspace create \
  --name $SynapseWorkspaceName \
  --resource-group $ResourceGroup \
  --storage-account $StorageAccountName \
  --file-system $FileShareName \
  --sql-admin-login-user $SqlUser \
  --sql-admin-login-password $SqlPassword \
  --location $Region

The output of the command should show the successful creation:

{- Finished ..
  "connectivityEndpoints": {
    "dev": "https://<synapse-workspace-name>.dev.azuresynapse.net",
    "sql": "<synapse-workspace-name>.sql.azuresynapse.net",
    "sqlOnDemand": "<synapse-workspace-name>-ondemand.sql.azuresynapse.net",
    "web": "https://web.azuresynapse.net?workspace=%2fsubscriptions%<subscription-id>%2fresourceGroups%2fS<resource-group-name>%2fproviders%2fMicrosoft.Synapse%2fworkspaces%<synapse-workspace-name>"
  },
  "defaultDataLakeStorage": {
    "accountUrl": "https://<storage-account-name>.dfs.core.windows.net",
    "filesystem": "<file-share-name>"
  },
  "id": "/subscriptions/<subscription-id>/resourceGroups/<resource-group-name>/providers/Microsoft.Synapse/workspaces/<synapse-workspace-name>",
  "identity": {
    "principalId": "<principal-id>",
    "tenantId": "<tenant-id>",
    "type": "SystemAssigned"
  },
  "location": "eastus",
  "managedResourceGroupName": "<managed-tesource-group-id>",
  "name": "<synapse-workspace-name>",
  "provisioningState": "Succeeded",
  "resourceGroup": "<resource-group-name>",
  "sqlAdministratorLogin": "<admin-login>",
  "sqlAdministratorLoginPassword": <admin-password>,
  "tags": null,
  "type": "Microsoft.Synapse/workspaces",
  "virtualNetworkProfile": null
}

After you successfully created these resources, you should be able to go to Azure Portal, and navigate to the resource called $SynapseWorkspaceName within $ResourceGroup resource group. You should see a similar page:

What's next?

You can now load data and experiment with it in Synapse Data Studio, create Spark or SQL pools and run analytics queries, connect to PowerBI and visualize your data, and many more.

Stay tuned for next articles in this series to learn more! Thanks for reading!

If this article was interesting to you, follow @lenadroid on Twitter.

Apache Kafka Applications Can Work Without Apache Kafka Cluster?

Lena Hall — Fri, 03 Apr 2020 16:00:23 +0000

If this article was interesting to you, follow @lenadroid on Twitter. If you prefer video format better, take a look at this video about the topic.

Apache Kafka and Azure Event Hubs are two different systems for managing events, that have the same goal in mind. Their aim is to provide distributed, reliable, fault-tolerant, persistent, scalable, and fast system for managing events, decoupling event publishers and subscribers, making it easier to build event-driven architectures.

Many projects already rely on Apache Kafka for event ingestion, because it has the richest ecosystem around it, many contributors, a variety of open-source libraries, connectors, and projects available.

Apache Kafka can run anywhere - in the cloud and on-premises. For example, one can run it on Azure using HDInsight for Apache Kafka, or deploy it on standard VMs.

One of the things we always have to keep in mind, there is infrastructure behind Apache Kafka that we have to maintain. Apache Kafka assumes of a cluster of broker VMs, which we need to manage. Sometimes we want to spend the least amount of time managing the infrastructure, but still, have a reliable backend for event-ingestion. This is the exact reason why someone might want to take a look at using Event Hubs for Apache Kafka ecosystems. You can keep using your existing Apache Kafka applications unchanged, and rely on Azure Event Hubs as a backend for your event-ingestion by just swapping the connection information. This allows to keep using Apache Kafka connectors and libraries to hundreds of projects and delegate the complexity to Azure Event Hubs behind the scenes to help you focus on code instead of maintaining infrastructure.

I'm confused, how can I use Apache Kafka and Event Hubs together? Event Hubs for Apache Kafka? What does it mean?

There are three parts we need to think about:

What is the system we work with on the backend - the one that collects events from producers and distributes it to subscribers? This could be Apache Kafka installed on pure VMs in your data center, Apache Kafka running in the cloud, or it can be Event Hubs - a managed service in Azure.
What is the application we have to work with the backend event-ingestion system? This can be an event producer, an event consumer, a command line application that connects to the backend event system.
How does the client event application talk to the backend event system? When the backend system is Apache Kafka, clients can talk to it using Apache Kafka API. When we decide to use Event Hubs, clients can talk to it using the standard Event Hubs API or using Apache Kafka API (thanks to Event Hubs for Kafka ecosystems feature).

This way, if you are already working with Apache Kafka, it can be easy to simplify management of your event infrastructure. You can keep using your existing Apache Kafka applications unchanged, and rely on Azure Event Hubs as a backend for your event-ingestion by just swapping the connection information. This allows us to keep using Apache Kafka connectors and libraries to hundreds of projects and delegate the complexity to Azure Event Hubs behind the scenes to help you focus on code instead of maintaining infrastructure.

A Real-World Example

Using Apache Kafka for event streaming scenarios is a very common use case. Frequently it is used together with Apache Spark - an event processing and distributed computing system.

There are many advantages of using Apache Kafka. One of them is the availability of so many useful libraries and connectors that let you send and receive events to and from a big variety of sources. Kafka ecosystem is incredibly rich and the community is very active.

Let’s take a look at a common architecture with Apache Kafka.

Apache Kafka acts as a data ingestion component, that receives data from some data producer. It can be data sent from sensors or other applications.

Apache Spark is a data processing system that receives the data and performs some processing logic with the data it receives in real-time.

There is nothing wrong with this architecture, and it’s very common. However, it can get complicated to run and manage your own Kafka clusters. Managing a Kafka cluster can become a full-time job. To make sure your Kafka cluster operates correctly and is performant, you usually have to tune and configure virtual machines that Kafka uses called brokers. When the cluster is scaled up or new topics are added you’d need to perform partition rebalancing. There are many similar things you’ll need to take care of.

Can We Simplify This?

One of the things you can do to optimize your architecture is to use a managed service that will eliminate the need for cluster maintenance. Event Hubs is a completely managed service in Azure that can ingest millions of events per second and costs 3 cents an hour. It is very similar to Apache Kafka in what its goal is. There are some differences in how they work. For example, with Event Hubs, you can use the Auto-Inflate feature to automatically adjust throughput according to workload spikes, and many more useful features.

In most cases, nobody wants to rewrite code and move to another service, and this is exactly the case with Event Hubs. Because Event Hubs protocol is binary compatible Apache Kafka, you can still use the code that you wrote that already works with Apache Kafka, and it will work with Event Hubs as well. This means you can still use your favorite Apache Kafka libraries, such as Spark-to-Kafka connector, and use Event Hubs as a backend for event ingestion and not ever think about cluster management again.

To start using Event Hubs with your existing Apache Kafka logic, all you need to do is change the configuration to point to Event Hubs instead of Kafka. If with Apache Kafka we use bootstrap servers to connect to it, with Event Hubs we’d use the public URL and connection string to connect.

As a result, we didn't change any logic for producer and consumer, we didn't change any libraries. With the only change in connection configuration, we can provide seamless migration to a completely managed service for event ingestion, and we can use it with many Apache Kafka clients, libraries, and existing applications.

Can You Show Us The Code?

Why, yes. Let’s take a look at how this can be done in-action!

Let's say, a company is already using Apache Kafka on HDInsight for their events, together with Apache Spark to process them.

They have an HDInsight Kafka cluster that looks like this:

They also have an Azure Databricks workspace that looks like this:

Their Spark cluster exists in the same virtual network as the Kafka cluster (using VNet Injection feature), and has the following Spark-Kafka connector library attached:

Let's generate the data for this use case. We can think of it as sensor data with a timestamp and some numerical indicator, which we generate using rate stream in Spark and send it to Kafka using the Spark-Kafka connector. Let's run it in Azure Databricks Notebook:



// MESSAGE PRODUCER LOGIC
val rates =
  spark
  .readStream
  .format("rate")
  .option("rowsPerSecond",10)
  .load

// MESSAGE PRODUCER CONFIGURATION
// CAN BE READ PURELY FROM CONFIGURATION FILES
// REPLACE TOPIC NAME AND BOOTSTRAP SERVERS WITH CORRECT VALUES
val TOPIC = "testtopic"
val BOOTSTRAP_SERVERS = "172.16.0.4:9092,172.16.0.6:9092,172.16.0.5:9092"

rates
    .select($"timestamp".alias("key"), $"value")
    .selectExpr("CAST(key as STRING)", "CAST(value as STRING)")
    .writeStream
    .format("kafka")
    .option("topic", TOPIC)
    .option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS)
    .option("checkpointLocation", "/ratecheckpoint")
    .start()

The real-time event stream of sensor data is consumed by a separate notebook:



// CONSUMER CONFIGURATION
// REPLACE TOPIC NAME AND BOOTSTRAP SERVERS WITH CORRECT VALUES
val TOPIC = "testtopic"
val BOOTSTRAP_SERVERS = "172.16.0.4:9092,172.16.0.6:9092,172.16.0.5:9092"

val rates = spark.readStream
    .format("kafka")
    .option("subscribe", TOPIC)
    .option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS)
    .option("kafka.request.timeout.ms", "60000")
    .option("kafka.session.timeout.ms", "60000")
    .option("failOnDataLoss", "false")
    .option("startingOffsets", "latest")
    .load()

// PROCESSING LOGIC 
rates
    .selectExpr("CAST(key as STRING)", "CAST(value as STRING)")
    .writeStream
    .outputMode("append")
    .format("console")
    .option("truncate", false).start().awaitTermination()

We use bootstrap servers to connect to the Apache Kafka cluster brokers in both, producer and consumer, and work with testtopic topic. On instructions for creating a topic in HDInsight Kafka and getting Kafka broker addresses, take a look at this document.

Swapping Apache Kafka backend with Event Hubs but leaving the code and libraries as is.

Now we want to start using Event Hubs, so we create a new Event Hubs with Apache Kafka feature enabled, and add a new testtopic hub.

To make the same code work with the new event backend, we only need to change the connection configuration in both producer and consumer. Instead of using IP addresses for brokers in bootstrap servers, we use Event Hubs endpoint. We also specify Event Hubs connection string. Because Event Hubs is a managed service, there is no cluster we need to manage, and the namespace (an alternative to the cluster in Apache Kafka terms) is just a container for topics. Scalability is managed using throughput units (1 TU = 1MB/sec, or 1000 events/sec) and can be adjusted automatically according to workload spikes.

Producer code:



// UNCHANGED MESSAGE PRODUCER LOGIC
val rates = 
    spark
    .readStream
    .format("rate")
    .option("rowsPerSecond",10)
    .load


import kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule
val TOPIC = "testtopic"

// NEW VALUE, REPLACE EVENTHUBSNAME WITH YOUR OWN VALUE
val BOOTSTRAP_SERVERS = "EVENTHUBSNAME.servicebus.windows.net:9093"

// NEW VALUE, REPLACE EVENTHUBSNAME, SECRETKEYNAME, SECRETKEYVALUE WITH YOUR OWN VALUES
val EH_SASL = "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"Endpoint=sb://EVENTHUBSNAME.servicebus.windows.net/;SharedAccessKeyName=SECRETKEYNAME;SharedAccessKey=SECRETKEYVALUE\";"

rates
    .select($"timestamp".alias("key"), $"value")
    .selectExpr("CAST(key as STRING)", "CAST(value as STRING)")
    .writeStream
    .format("kafka")
    .option("topic", TOPIC)
    .option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS)
    .option("kafka.sasl.mechanism", "PLAIN")
    .option("kafka.security.protocol", "SASL_SSL")
    .option("kafka.sasl.jaas.config", EH_SASL)
    .option("checkpointLocation", "/ratecheckpoint")
    .start()

Consumer code:



//CONSUMER CONFIGURATION 
val TOPIC = "testtopic"

// NEW VALUE, REPLACE EVENTHUBSNAME WITH YOUR OWN VALUE
val BOOTSTRAP_SERVERS = "EVENTHUBSNAME.servicebus.windows.net:9093"

// NEW VALUE, REPLACE EVENTHUBSNAME, SECRETKEYNAME, SECRETKEYVALUE WITH YOUR OWN VALUES
val EH_SASL = "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"Endpoint=sb://EVENTHUBSNAME.servicebus.windows.net/;SharedAccessKeyName=SECRETKEYNAME;SharedAccessKey=SECRETKEYVALUE\";"

import org.apache.kafka.common.security.plain.PlainLoginModule

// READ STREAM USING SPARK's KAFKA CONNECTOR
val rates = spark.readStream
    .format("kafka")
    .option("subscribe", TOPIC)
    .option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS)
    .option("kafka.sasl.mechanism", "PLAIN")
    .option("kafka.security.protocol", "SASL_SSL")
    .option("kafka.sasl.jaas.config", EH_SASL)
    .option("kafka.request.timeout.ms", "60000")
    .option("kafka.session.timeout.ms", "60000")
    .option("failOnDataLoss", "false")
    .option("startingOffsets", "latest")
    .option("kafka.group.id", "$Default")
    .load()

// UNCHANGED PROCESSING LOGIC 
rates
    .selectExpr("CAST(key as STRING)", "CAST(value as STRING)")
    .writeStream
    .outputMode("append")
    .format("console")
    .option("truncate", false).start().awaitTermination()

When Should I Still Use Apache Kafka Cluster?

For example, when you want or need to manage your own cluster, or when you want to run Apache Kafka on-premises. Or, if there are certain features not yet supported by Event Hubs Kafka feature.

When Should I Use Event Hubs With Apache Kafka Clients?

When you are happy with what Event Hubs provides, and want to reduce the time you spend on managing clusters. Event Hubs offers better integration with existing Azure services. You can also mix and match Apache Kafka and Event Hubs clients! Event Hubs supports many automation features, like auto-inflate to scale the system and adjust it for the workload.

Take a look at other interesting examples of technologies Event Hubs for Apache Kafka can be used for.

Summary

We can keep using Apache Kafka libraries and connectors when using Event Hubs as a backend event injection system, which opens the door to an incredible number of integrations.

Applied Cloud Stories

Lena Hall — Tue, 21 Jan 2020 17:22:24 +0000

Have experience with building practical applications or driving complex workloads on Azure. Share your #AppliedCloudStories with us!
✓ Sign up for a free account on Azure and validate stories
✓ Submit your story and spread the word on Twitter!

We are delighted to announce the Applied Cloud Stories initiative by Microsoft!

📚 What Is Applied Cloud Stories?

Do you work with open-source? Are you passionate about machine learning or data science? Do you have stories to share about solving scale or data challenges? Are you investing time and effort so that you and your teams can ship code better, faster, with more confidence? Do you work with Java, Python, JavaScript, Go, Rust, or other exciting languages in the cloud? Are you active in computer science research and love when theory meets practice?

We would love to hear your story.

Applied Cloud Stories is an open call for technical content about relevant cloud scenarios and workloads, interesting challenges, and practical solutions that can run on Azure. You can participate by writing a new article, or recording a new video about the topic of your choice in one of the categories:

Open Source
Applications
DevOps + Infrastructure
Big Data + Distributed Systems
Machine Learning + Data Science
Research

Share your stories before March 15, 2020! Learn more about the timelines, rules of participation, and details about each category in the announcement.

📚 Why Should I Participate?

Your stories will help inspire and educate engineers across the globe about how to approach advanced, innovative, and mission-critical cloud scenarios that are solving specific problems.

All of the stories will be reviewed by the amazing committee consisting of Azure engineering leaders, cloud advocates, and industry leaders, like Emily Freeman, Jérôme Petazzoni, Ashley McNamara, Sarah Novotny, David Aronchick, Nina Zakharenko and more.

Applied Cloud Stories Reviewers will vote to determine winning and featured stories. We will help share featured stories with the world! Winning stories will receive prizes and rewards!

📚 Wait, I Have To Write About Azure?

Not really! The minimal requirement is to use any part of the Azure platform to run scenario workloads on. However, we are not looking for tutorials that explain how to apply Azure services to solve the task. We are more interested to hear about people’s scenarios and technology choices first. Check out the announcement for the examples.

Need Ideas?

Don't hesitate to reach out to me or to any of the committee members to validate your ideas, we'd love to help!

To make it easier, our committee members worked on example good story titles to inspire more ideas!

Story of Eventually Perfect Distributed Systems

Lena Hall — Thu, 22 Aug 2019 00:12:22 +0000

This article is based on O'Reilly Velocity 2019 Keynote by @lenadroid.

It's about the impact of our work, the complexity and obstacles we face, and what is important for building better distributed systems, especially when other life-critical areas rely on and build on what we create.

Systems available today already offer many solutions, encapsulate a lot of the distributed algorithms, automate, and abstract away some of the complexity. Engineers who use them don’t necessarily have to have the same massive knowledge that was required to develop them in the first place. Even though there is less and less necessity for new engineers to learn the fundamentals those systems are built on, there are scenarios when knowing what is behind the scenes is essential to making the right decisions and to solving difficult issues that come up when something doesn’t act as expected.

Are Fundamentals Still Important?

Why is it important to come back to fundamentals from where we are now? We are at the point, where distributed systems get increasing adoption in the medical field, autonomous devices, transport automation, and other life-critical scenarios, where the cost of the mistake is growing, and correctness becomes really important.

Cost of mistake is not how many seconds your system was unavailable today. It’s about what was the price of your system's inaction or failure for your users and their users. We should take it with responsibility and always remember why we’re doing this, what is the real problem we’re solving.

Every industry wants to make more progress by combining their and our research and solutions. Our work helps other fields and domains to better achieve their goals. And it can be immensely useful for us to understand how to relax certain limitations or fine-tune certain trade-offs.

Understanding what’s at the core is a powerful tool to navigate the complexity of ever-changing options and tools, and it helps us to compose correct solutions to improve the options we have.

Turns out it’s not so easy, there are some obstacles in our way!

There Is A Big Gap Between Theory And Practice

It’s challenging to understand what “correct” means for our systems. Most of the theoretical material is not approachable enough, it is notoriously hard to grasp. It often doesn’t include the information necessary to successfully bring this theory into practice. Production systems have to modify the theoretical algorithms and adjust them to work in real environments. Many of them don’t reveal the specific details important for practical solutions. And even a slight misunderstanding of the details of the protocol destroys its correctness. So we need to do additional work to guarantee that implementation is still correct.

Hard To Verify And Maintain Correctness

It’s challenging to verify and maintain correctness of distributed systems in real-world environments. It could sound perfect on the paper but could be inefficient or hard in practice. It can sound unrealistic in theory but be perfectly acceptable in practice. There are many things that can go wrong both in the algorithm logic and its implementation that are hard to detect.

Correctness Isn’t Always A Priority

On the other hand, correctness doesn't always happen to be a priority. There are deadlines, competition, and customers who need solutions faster. It can happen that the end-system will be rushed and not properly correct. This means a team might not have time to properly discuss and systematically address real reasons behind rare “intermittent” errors which will happen again lead to more errors in the future.

How Can We Improve?

We can improve in many directions. One of them is emphasizing and putting more focused efforts on improving correctness, to make sure we are able to build and maintain systems that keep doing exactly what we expect them to do.

Another one is improving understanding of how they work as it helps us reduce the complexity and possible mistakes, and makes us more prepared to face the challenges that come up.

When we aren’t implementing distributed algorithms and concepts directly, we definitely rely on systems that do. At the point where what we build intersects with other domains and fields, understanding fundamentals concepts and trade-offs becomes extremely relevant.

If you are promised some performance and consistency, how do you actually make sure those guarantees are provided at the exact level you need?

Simple Problems Become Hard

When several computers are talking to each other, trivial problems become hard problems, and they accumulate. Distributed systems are hard to understand, hard to implement, and it’s hard to keep them correct in practice.

I think there are many ways to show why. Recently, I had a chance to explain it to someone from the field of bioinformatics, who was wondering why they needed to make trade-offs between important properties in a distributed setting.

Ordering

The first thing that came to my mind was ordering events. Ordering is easy on a single machine, but when messages are sent across the network, it’s hard! We can’t rely on physical timestamps because physical clocks on different machines tend to drift. For ordering in distributed systems, we often apply logical clock, or simply speaking, counters that are passed around between nodes.

Because of the asynchronous nature of distributed systems, we can’t easily establish compete order for all the events, because some of them are concurrent! What we can do is figure out which events are concurrent and which events happened before one another. And even with such a simple task, we already need to make some decisions.

For example, would we be okay if the system told us that events are ordered, but in reality, they turned out to be concurrent?

Or do we need to know for sure, that events really are not concurrent when we can order them?

Agreement

We can’t simply order concurrent events, and sometimes, we still need to decide on the order of operations, on a value, a sequence of values, or anything else.

Turns out that getting several machines to choose the same thing is another situation where we have to ask ourselves questions and determine what’s right for us.

For example, Two-Phase Commit is one of the solutions where our nodes can all agree on something.

It works, as long as our nodes don’t fail.

If some of the nodes crash, to prevent any possibility of inconsistent data, the system just blocks until the crashed nodes come back, which might never happen or take a very, very long time.

So the algorithm is safe but isn’t live.

What if it’s not something we can accept and we really need a system to respond?

In this case, we might have another possible solution - three-phase commit. It doesn’t block when there are failures.

But when there’s a network partition ...

The two isolated sides of the network can come to two different decisions after they time out, and the system will end up in an inconsistent state.

So here we have the opposite: the system is responsive and live, but it’s not safe because different nodes can decide on different values.

If we are okay with any of these two options, that’s great!

What should we do if we want data to be always consistent and the system to be responsive?

Impossibility Result

The impossibility result proved that actually, there isn’t a deterministic algorithm that will always terminate and come to a decision in a fully asynchronous environment, with even one possible failure.

The main thing we can learn from this result is: if we want to solve agreement in practice, we will have to rethink our assumptions to reflect more realistic expectations! For example, we can put an upper bound on maximum message delays, and determine what is an acceptable number of failures for our system.

If we change our assumptions, we can solve distributed agreement in multiple ways!

Paxos

The most famous solution is Paxos algorithm, which is known for being hard to understand, and hard to implement correctly.

It actually works, but it only under the condition, that majority of nodes have to be up, and the maximum message delay is bounded.

In Paxos any node can propose a value, and after going through a “Prepare” and “Propose” phases, all of the nodes should agree on the same value.

The majority of nodes need to be up, because if during each phase quorums intersect, there will always be at least one node that remembers what the most recent proposal is, which prevents agreeing on an old value.

There are many optimizations to the initial algorithm that are applied in practice to make it more efficient. There are also many possible variations to consensus algorithms based on what trade-offs they chose.

For example, how much work is done by the leader. Having a strong leader can be good or bad, depending on how frequently it fails and how hard it is to reelect. Another trade-off is how many node failures can we forgive and how big should the quorum be.

Somewhat underrated criteria is how understandable and implementable the algorithm is in practice. Raft is popular because it was more understandable and is now applied in many widely used projects.

Still Discovering New Trade-Offs

But what’s even more interesting is that even though the topic of consensus and agreement isn’t new, we are still discovering many new optimizations and trade-offs.

In classic Paxos, the majority needs to be up to make sure all the quorums will intersect. But it turns out we can rethink and simplify the majority quorum requirement. Turns out, it’s enough only for quorums of prepare and propose phases to intersect, which gives us much more flexibility to experiment with quorum sizes and performance in each phase!

The main point is, revealing new performance and availability trade-offs up to this day, helps us expand the spectrum of our choices in practice. Consensus is just a building block, but it can be used to solve many common problems, like atomic broadcast, distributed locks, strongly consistent replication, and many more!

Please check out Dr. Heidi Howard's paper "Distributed Consensus Revised". It's one of the very best papers on the topic.

Consistent Replication?

Replication is a massive part of any distributed system today.

We can actually use consensus to implement strongly consistent replication, but one of the downsides of it is performance. Another side of the spectrum is of course, eventually consistent replication, which is very fast, but then clients can see inconsistent data. In practice, we often want better performance and still maintain stronger consistency, which can be tricky.

So in some cases, we can come up with solutions that are faster than consensus, and are more consistent than eventual consistency.

One of the interesting examples is Aurora, where they avoid consensus for I/O and a couple of other operations. They use quorums for writes, but don’t use them for reads. In reality, replicas might be storing different values, but when client performs a read, because the database maintains consistency points, it can look directly at nodes where data is known to be consistent, and return the correct data to the client.

Conflict-Free Replicated Data Types (CRDTs)

Another interesting example is Conflict-Free Replicated Data Types. They can provide strong eventual consistency with both, fast reads and writes, and staying available even during network partitions, without using consensus or synchronization. However, it’s only possible if we can have rules for resolving any concurrent conflict.

In other words, we can only use this technique if it’s possible to merge concurrent updates using some function that can apply them in any order, and as many times as we want, without corrupting the result.

This is a perfectly acceptable example, where all the updates are additive, so they perfectly satisfy this requirement.

This one, on the other hand, isn’t as obvious, as we don’t have clear rules for resolving conflicts with simultaneous updates of this kind.

Azure Cosmos DB uses CRDTs for conflict resolution behind the scenes of concurrent multi-master multi-region writes. Redis and Riak also use CRDTs.

Failure Detection

If we teleport into another topic in distributed systems, we will always find more trade-offs.

Failure detectors is one of the essential techniques to discover node crashes in a distributed system. They can be applied in agreement problems, leader election, group membership protocols and in other areas.

We can measure the efficiency of failure detectors by their “completeness” and "accuracy”.

Completeness shows whether some or all nodes in the system discover all the failures. Accuracy measures the level of mistake a failure detector can make in suspecting the failure of another node.

Turns out, even unreliable failure detectors can be extremely useful in practical systems because we can improve their completeness by adding a gossiping mechanism that spreads the knowledge about failures to all the nodes.

Why Does All Of This Matter?

Trade-offs may take different shapes and forms, and we can be really flexible if we know how to use them and where to look.

Many products are built around the algorithms and trade-offs. These products make certain choices for us, and we make choices by using certain products. Uneducated choices can result in delays and data loss. For some systems, this can lead to losing clients and large amounts of money. For other systems, it can result in slow reaction, or wrong order of actions, that poses an actual life threat. Understanding your trade-offs is very important for making the right choices, for knowing what correct means, and for verifying correctness of our systems in reality.

Verifying And Maintaining Correctness In Reality

After we are clear with our decisions and trade-offs, how do we maintain correctness in real systems?

One of the frequently used options to verify distributed logic on safety and liveness, especially safety, is system model checking. Model checking is useful because it explores all possible states your system can end up in. There are quite a few tools out there. TLA+ is pretty famous, there are more emerging techniques like semantic aware model checking.

To verify correctness of real, running implementations of distributed systems, model checking alone isn’t enough.

Not many projects publish information on how they maintain correctness of their systems and verify it. But some of them do.

For example, a variety of system tests for Kafka is run every day and anyone in the world can check and see what is working as expected and what isn’t.

Cassandra has a really great write up about their approach to comprehensive testing.

I really wish more products, projects, and systems would be more open about efforts they put into testing and correctness verification.

If we look at what it takes to be prepared to run a production-level distributed system, there’s quite a lot.

Of course, for small scope scenarios and ensuring that multiple services work together well, unit tests and integration tests are essential. But not enough! There are more techniques we can use. Fuzz testing and property-based testing provide randomly-generated input to your systems to make sure that its fundamental properties are correct based on its specification. I actually worked on a fuzzing project at Microsoft Research and it’s a really fascinating topic in general. Performance tests are useful to collect data on latency and throughput of various components. Fault injection is helpful for checking that the system is available during fault scenarios and that expected system properties remain correct.

With all of that, an insane amount of reasons behind the most critical errors is in exception handling logic.

There are some things we can’t fully fix. We need to accept the fact that, in reality after all the tests and checks we’ve written, there will be mistakes anyway. We’re humans, there are context switches, and it’s impossible to know every single thing, there are too many moving parts. We’re never exploring new territory, we’d never make progress if we were afraid to leave the area we’re familiar with. However, we can definitely prepare ourselves better for dealing with unexpected errors, find patterns, and try to address what causes them. That’s why instrumenting your code matters and observability matters. It’s less scary when we’re aware of the possibility and have built the foundation for solving production errors.

Take Aways

Products change fast, and terms that describe their consistency, resiliency, and performance are extremely overloaded. Fundamentals concepts and trade-offs stay and build-up. They aren’t useful in isolation, but knowing them can be essential for making the right choices and maintaining correctness in practice. Correctness is especially important when our systems are trusted in scenarios where a specific level of responsiveness and safety is a strong requirement.

If you are building something, ask yourself a question: can this be misunderstood? Complexity is like a big bulletproof wall around your project, it makes it hard to explain, build, use, and improve. Try to make systems you build understandable to others because understanding contributes to correctness.

Correctness isn’t easy and doesn’t come for free. You have to work on it and make it a priority. Not just at a level of one engineer willing to do it, but at the level of the entire organization. Don’t trust your system to just work: Test it, Verify it, and be ready when things fail. Show your users and customers what techniques and efforts you are putting into verifying your systems.

Think of areas related to your work that aren’t getting enough attention and are important. If you ever have an opportunity to chat with people different than you, who work in another area - do it, to learn more about what challenges they face in their work and what trade-offs they’re making. Ask questions. Share the same with them about your work. It will help you be a better engineer.

Making Machine Learning Approachable

Lena Hall — Wed, 03 Apr 2019 21:55:34 +0000

Attend ML4ALL in Portland, OR April 28-30.

The topic of Machine Learning advances and plays a bigger role in our development lives every day. It sounds exciting, and it's a universe of areas on its own. People can spend their entire lives studying some topic within machine learning, and there will be more and more to learn.

There isn't a person who knows everything there is about machine learning, but there are people who know a lot about certain machine learning areas. For example, someone can be an expert in machine learning for recommendation systems or natural language processing but know nothing about deep learning for computer vision. Someone can be great at computer vision but don't know anything about natural language processing.

Is Machine Learning Really THAT Hard 🤔

TL;DR: It can be, but doesn't have to be THAT hard.

Often we hear about machine learning and deep learning as a topic that only researchers, mathematicians, or PhDs can be smart enough grasp. When machine learning appears as the most complex area in computer science, it is most likely because of several common reasons:

Language used in many learning resources is cryptic.
There aren't enough real-world examples of implementation of machine learning algorithms applied to business problems we can relate to.
Explanation of fundamental concepts assumes knowledge of a certain amount of mathematics and notation and uses insufficiently explained formulas.
Many learning resources are often hit or miss, as some of them are too difficult to understand, and others are hiding too many important underlying details.

Making Machine Learning Understandable ✨

It is possible to explain seemingly complex fundamental concepts and algorithms of machine learning without using cryptic terminology or confusing notation.

With this thought in mind, I and group of like-minded friends decided to organize a community conference about machine learning "for the rest of us", and call it 🎉ML4ALL 🎉. We set some goals for the conference:

Invite speakers that can articulate difficult concepts in an approachable, non-boring, intuitive ways
Make sure the conference is extremely affordable and accessible to those who want everyone to get machine learning knowledge.
Build community of friendly and curious people, passionate about machine learning and data science.
Provide a platform for people to collaborate and exchange ideas through an unconference and encouraging discussions.

We held the first ML4ALL conference in May 2018 and it was successful!

All of the talk recordings are available online. For example, Paige Bailey gave a great talk called "Kill (Deep) Math":

ML4ALL Is Back!

Turns out, the audience loved ML4ALL more than we could ever expect, and we immediately knew that we are on the right path. This is why ML4ALL is back to Portland this year on April 28 - 30.

Come and be weird with us at ML4ALL this year!
If you are coming from Seattle, you can join us on a Machine Learning Train to Portland.

🎟🎟🎟Tickets: Buy here (extremely affordable - $150-$375).

We have already announced our first speakers with a wide range of topics, including:

✓ Connected Feature Extraction
✓ Classifier to Listen to Killer Whales
✓ Machine Learning Ops
✓ Quantum Machine Learning
✓ Feature Engineering
✓ Privacy in Machine Learning
✓ Churn Prediction

With many more topics and speakers announced soon.

The conference is organized 100% by the community: Lena Hall, Troy Howard, Adron Hall, Byron Gerlach, Glenn Block, and Ben Acker (Ben created the ML4ALL art-character ❤️).

❓What are the topics you'd like to see explained better ❓