DEV Community: SqlinSix

Quick Tip: Secret Scopes For Azure Key Vaults

SqlinSix — Mon, 09 Sep 2024 10:14:28 +0000

How do you create a secret scope for an Azure Key Vault in Databricks?

Go to your Databricks instance URL and adding [#]secrets/createScope (note: no brackets around #)
Add the Scope Name
Enter the Key Vault's DNS Name. You can get this in the overview of the KeyVault.
Add the Key Vault's Resource ID. You can get this in the properties of the KeyVault.

Pick a name that's intuitive for development and makes automation easier for DevOps (avoid arbitrary naming).

Building A Data Warehouse Library

SqlinSix — Mon, 05 Aug 2024 11:03:09 +0000

How do you build a data warehouse with OOP? I suggest designing by delineating the classes that you'll need to both create and maintain the data warehouse. The first time that you build these classes, you will probably build them wrong. This is expected as testing helps improve the design over time. A general overview of the classes that you'll create:

Your configuration or connection classes should be able to access the appropriate properties to the source, staging and final destinations. Related libraries should be imported at the highest scope here, if these libraries cannot be accessed or referenced at more detailed levels of your library. For example, you can import pandas wherever you need (generally transformations), so you wouldn't need to import it here at this step unless you'll use it.

Consider how you separate the dimension and fact classes. One technique that I tend to use involves fact and dimension classes inheriting from a base object. This base object has the standard columns that we tend to use frequently in ETL, such as timestamps of when records were added, updated and whether they're active. Because dimensions can be different types whereas facts cannot, facts and dimensions have different properties and methods. For instance, a fact class will have a foreign key reference method that ensures that a key column ties out with a dimension. A dimension doesn't need this because a dimension can have a key that isn't found in a fact, but the inverse cannot be true.

Build a transformation class for repeatable functionality. Generally, transformations occur on a single column level, though there are some cases in which they occur on multiple columns or even the entire table (much rarer). One of the most expensive steps in building and maintaining a data warehouse (and ETL in general) is the transformation step. Over time, you may find it convenient to further break this tranformation class into smaller classes, such as transformation classes for machine learning, analysis, etc. For instance, one of my repetitive methods involves pivot tables. Over time, I've added this functionality into a one hot encoding method. Keep in mind that you can spend decades never doing certain transformations and you will generally find 20-30% of your transformations are the most repeated in use.

Finally, consider scale. Scale can impact all the other steps, but it can also cause people to never finish the other steps. I've seen more waste in planning for scale than actually solving it. When you know the data volume will grow, you want to consider it. Make sure that this doesn't become the procrastination point that results in unnecessary planning meetings where nothing gets done. Horizontally scaling your data may require change to your other code. Accept that so that you can still create repetitive functionality. In general, scale is not always an issue that people in data face and this is why I've seen worries about scale cost companies more in planning than if they approached the problem without the concern.

This provides an overall picture of how you'd approach a data warehouse with OOP. This is not the actual solution that you'd use as details matter.

Quick Facts On the AT&T Breach

SqlinSix — Mon, 15 Jul 2024 11:03:14 +0000

For those of you who haven't seen yet, AT&T recently informed customers of a breach.

Some quick facts about the AT&T breach:

Impacted over 70 million customers.
Revealed the call and text meta information. I'll add to this point because I've seen some misinformation showing that content (ie: messages) were compromised. This is not correct.
The meta information involved not just AT&T numbers, but also numbers associated with those AT&T numbers (ie: contacts who conversed through phone calls/texts).
The compromise period was during a few months in 2022.
Unfortunately, there was a delay in releasing that there was a breach.

Unfortunately, there's quite a few ways that hackers can use this information. One big cost to this is that it expands hacker's linkability with your information. If you're wondering how hackers can use linkability, I wrote a post for OSINT called Why everyone is panicking over linkability.

Importance of Data Discovery Spikes

SqlinSix — Mon, 10 Jun 2024 10:53:23 +0000

A meeting prior to a data discovery spike tends to use time poorly for everyone involved. In corporate situations (ie: not contract/client based) where you can spike on a data set prior to meeting about the needs, this will save everyone time.

As you first review the data (spike), you can catalog as needed, provided that this isn't already available. If it's available, compare how it's cataloged versus what's actually there (in some cases, data is labeled incorrectly). With this technique, you'll have an understanding of what you're looking at and where the data live. When you meet about the data needs, you'll already have mental links to the data.

This is the point where I'll use pre-built queries/analysis to review the data set.

The result of this is that you'll take much better notes due to the mental links plus you'll ask better questions if you have them. The best part is that you'll save everyone time - there is no shortage of people in the corporate world complaining about the number of meetings they attend!

Row Level Security In SQL Server

SqlinSix — Fri, 31 May 2024 11:21:46 +0000

Row level security (RLS) provides us with a security tool that can limit data access on the row level of data. This security technique serves value in situations where we may require delineating access to data where a “whole” is stored, but only a “part” needs to be accessed by users — such as a product list from all companies where a company can only see their products.

While this post will do an example in SQL Server, row level security can be used in SQL Server 2016 and above, Azure SQL, Azure SQL Managed Instances, Azure Synapse and Warehouse in Fabric.

When Would We Use Row Level Security?

While RLS isn’t the only tool that will help us in these situations, it may be a tool that we find useful in the below situations that we could use it. We might consider RLS in the following circumstances:

Complying with strict security requirements from our company or legal requirements where access to sensitive data must be filtered by user or role within the company.
When the data we store on the row level relate directly to the responsibilities of the user or role who views the data. If the data relate to the user or role on a column level, we would consider dynamic data masking.

A visual way to think about RLS is a pie chart where each piece of the pie represents a user and each user should only be able to view their piece of the entire pie. Like with dynamic data masking as we’ll see, there are some security concerns with RLS, which is one reason why it’s not high on my recommendation list of security tools.

Row Level Security Considerations

Like with dynamic data masking, row level security (RLS) invites attacks because the data are present in the database. A hacker knows that if he compromises your database, he can get access to your data. In the case of RLS, the hacker knows that it’s only a matter of permissions. Hackers can use social engineering to determine if a company is using row level security and if this is the case, they know they only have to compromise the database. A more specific example of this (and a form of social engineering) would be to use social media to identify who works at a target company, whether they use RLS, and compromise the person’s access (spearphishing from social media analysis would be another route). Even if we assume the hacker only compromises one employee (one makes it easier to compromise a second and third), that is enough to get the data.

By contrast, in the case of a security approach like always encrypted where a different layer has the key to unencrypt the data, the hacker must compromise two layers — simply getting the data won’t be enough. In both the case of dynamic data masking and row level security, this is not the case.

If we plan to use RLS, we need to ensure that it functions through sufficient testing. As we’ve seen above this, AFTER and BEFORE UPDATE differ in what could happen with users. We would want to have detailed testing scenarios that ensure that row level security will function as intended. We should always remember that we’re designing for failure (or in the case of security, compromise). This means that we test with our architecture as if we’re trying to compromise the security. I highly recommend researching people who’ve used this feature with their architecture, as they will make note of some of the issues they’ve seen in their environment.

Continue reading the entire post with T-SQL examples

Quick Databricks' Cloning Tip

SqlinSix — Thu, 18 Apr 2024 10:31:44 +0000

Quick Databricks tip:

Two types of cloning you can use with objects in Databricks:

DEEP CLONE
SHALLOW CLONE

Deep clone copies the data while shallow clone only copies the meta-structure.

Both are extremely useful in different types of testing.

Quick Databricks Meta-Information Tip

SqlinSix — Mon, 01 Apr 2024 11:14:26 +0000

To get meta information on a table, you can use DESCRIBE EXTENDED or DESCRIBE DETAIL.

Quick examples:

DESCRIBE EXTENDED MyTable

Or...

DESCRIBE DETAIL MyTable

Each have their own set of details and overlap with meta like create time, type, etc.

Will AI End All of Us?

SqlinSix — Sun, 24 Mar 2024 12:56:52 +0000

Everyone is talking about it.

It's hyped everywhere now.

This site is by devs for devs. Is AI the end for all of us?

Connecting A Data Factory To An Existing Runtime In Azure

SqlinSix — Thu, 28 Oct 2021 15:47:37 +0000

With Azure Data Factory, we can connect on premise assets to Azure resources, such as an on premise database to an Azure database. At this time, this will require an on premise runtime that communicates with both the on premise resource and the Azure resource. In the video Connect Azure Data Factory To Existing Runtime, we see how to connect an Azure Data Factory to an existing run time (to set up a runtime on premise, review the video How To Setup A Runtime for Azure Data Factory, as this video assumes that we have a runtime setup on one of our servers that has been linked to a data factory). While not discussed in the video, we should review the security of our design - we want to be extremely careful connection on premise assets to the cloud. If we will be sending data from an on premise asset, make sure that both the server with the data and the server running the runtime (if different) are isolated.

Before connecting an Azure Data Factory to an existing runtime, we should make sure that the following our true for our environment:

We've considered all security angles for running a runtime that connects an on premise asset to Azure's cloud.
We've looked at the scale ahead of time - being aware of the data load that we expect and how we'll parallelize this across assets, environments, etc. Keep in mind that the runtime may become the bottleneck, if we're using it for multiple assets/environments/etc. Part of this analysis is ensuring that we're transferring the least amount of data required to meet our objective.
We have monitoring in place to discover issues as quickly as possible, as if we have any production dependency on this, an outage could be costly.

These pre-requisites are a must, as without this analysis ahead of time, we may face costly issues later.

Once we have our runtime setup, we can grant permissions from this runtime to Azure Data Factories. We will want to consider how many data factories we want to share this with, as well as whether sharing this will comply with some of our other standards. For an example, we should not connect one runtime to our Sandbox-QA-Preproduction-Production environments for multiple reasons - scaling, testing and security, to name a few reasons. The only exception to this would be if we were demonstrating what NOT to do. The same logic applies if we're connecting a runtime to multiple assets within an environment: we want to consider scale early, as it will be more costly if we make adjustments later.

T-SQL: How To UNION ALL Tables and Why

SqlinSix — Thu, 07 Oct 2021 08:39:04 +0000

In some cases, we need to combine the result of two tables where data types may be similar or identical. We could combine the data in one table using INSERT syntaxes, but with UNION ALL or UNION, we can achieve the same result instead of creating a new table. In the video, SQL Basics: How To Use UNION ALL and Why we see several examples of this in action. One example we union tables with the same data type and in another example, we union two columns that have different data types.

Some questions that are answered in the video:

Note the tables that we're using and what values are identical based on the column names and what values differ based on the column names. We'll be using these tables throughout these videos. What is a like data type? Why does it matter in a union? What's the outcome when we union two like data types that slightly differ?
Compare the results of the queries when we union identical data types versus when we combine two different data types.
If we have a type of varchar and a type of integer, what tool could we use to combine the columns in a union?
What should we know about union and security?
While there are object oriented tools that may automatically convert data types, if we have mismatches, we should be aware of the underlying data if we use a union.

One applied example of using unions is error logs. Often we have error logs for the database, application and other possible layers. It's useful to identify when an error happened and how that error translated across layers. Unioning tables with the errors by combining the error and date can be useful when we order by the time of the error. We can often find out where the error originated and how it impacted all the layers of our application or service.

Solving Data Differentials With LEFT JOINs

SqlinSix — Fri, 01 Oct 2021 11:44:21 +0000

One challenge we face with data involves new and existing data from a source data set to a destination data set. The source and destination may be managed by us or one of them could be managed by us. When we have these data sets, we will often get "updates" to these data - in some cases, these updates will involve new data that don't already exist or may involve existing data that need to be updated. In rarer cases (but still frequent), we may need to remove data that no longer exist in an update. This latter case is rarer because most "differentials" will involve adding new data and updating existing data. For the benefit of these "delete" scenario, I also cover it in the video, Using LEFT JOINs To Solve New and Existing Data Differentials, along with updating and adding new data. This means that this video shows how we can use LEFT JOINs to solve all three of these scenarios. These scenarios come up frequently in business so it's useful to know how we can use a join type to solve them. Keep in mind, this is a tutorial video to show you how this can be done, not a recommendation of whether you should use this or not.

Some discussion points mentioned in the video:

Relative to what's required for our differentials, why does the order matter? I show this in the demo and mention this warning early, as something you want to consider. Using the demo, imagine if I re-ordered what operation happened first. What would happen to our data?
Assume we need to delete, update and insert. What should happen first? Why?
When it comes to data differentials, which of the 3 operations occurs less frequently? What is an example I use of this type of data differential?
Based on the sample data set, what should occur with the final outcome of our data? Why should we run a test first?
Why do I update before I insert?
T-SQL is a set-based language. When it comes to performance and adding new data or updating existing data, how should we consider the fact that it's a set-based language in our operations?

When we look at solving data differential with LEFT JOINs, those above points and discussions are worth considering. If we don't order the operations correctly, as an example, we may end up with incorrect data or updating records that were just inserted, poorly affecting our performance.

Can we only solve data differentials with LEFT JOINs? No. We can use merge operations, match data on the application layer, use except functionality supported in some SQL languages, etc. In this tutorial, we see the capability of LEFT JOINs, but they are not the only tool we can use to solve this problem. When we consider solving this problem, we want to consider two key points in development: performance and maintainability. Not every developer may be familiar with the tools we use to solve this problem - we'll want to consider how our solution is maintained. Likewise, we want to consider performance. Relative to our architecture, using LEFT JOINs may not be the most optimal solution. Unfortunately, there is no hard rule about how to solve this problem. But since we see how we can use LEFT JOINs to solve this problem, we have another application of using LEFT JOINs and where they may be appropriate.

Transaction Log Becoming Full Due To Replication

SqlinSix — Sun, 26 Sep 2021 10:45:27 +0000

In the video, The transaction log for database 'DB' is full due to 'REPLICATION', we look at this error and what it means along with some possible solutions (architecture dependent). We do not want to develop the habit of micro-managing transaction logs, so we should consider sizing before setting up any architecture such as replication, ETL solutions, etc. As a note that's not mentioned in the video: since transactional replication uses the log and involves full recovery, we cannot switch to simple recovery without first disabling replication. Thus, we must be in a situation where we can make log adjustments if we cannot disable replication.

Consider these in the context of the video:

What is an activity that we should avoid with the log?
Why is development around the log of a database so important?
Considering that this is an error we may face at some point, what are some architecture considerations regarding the log when we're replicating data?

In general, everything in a database relies around the log. For this reason, we want to consider how we plan for log growth from when we first create our database. If we do not make these plans appropriate, we'll run into a myriad of problems involving the log. Our log growth will also dictate that type of transactions that we allow. For this reason (and others), I would heavily monitor log growth for any database server where replication or availability groups are setup at minimum. Monitoring log growth, if allowed (in some contexts), may be a requirement across the board, but it's especially true when we have availability groups and replication setup.

One regular point I like to remind my audience, especially as of recent: we can often solve the same problem with a variety of techniques. While we look at one or two ways to solve the problem, these aren't the only ways we can approach this problem. The most appropriate solution to a problem is one in which you can troubleshoot quickly in the future and one in which you understand. In this case, prevention is more powerful than solving a problem at the last minute: the last minute solution will come with more costs (inconvenience). Be careful about applying solutions that solve a problem, but introduce new problems in the future.