<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rajat Venkatesh</title>
    <description>The latest articles on DEV Community by Rajat Venkatesh (@vrajat).</description>
    <link>https://dev.to/vrajat</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F324905%2F051cab66-b51d-4f88-b9b9-e9f238868593.png</url>
      <title>DEV Community: Rajat Venkatesh</title>
      <link>https://dev.to/vrajat</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vrajat"/>
    <language>en</language>
    <item>
      <title>Database Access Stack in Rust</title>
      <dc:creator>Rajat Venkatesh</dc:creator>
      <pubDate>Mon, 23 Oct 2023 12:43:44 +0000</pubDate>
      <link>https://dev.to/vrajat/database-access-stack-in-rust-3k47</link>
      <guid>https://dev.to/vrajat/database-access-stack-in-rust-3k47</guid>
      <description>&lt;p&gt;I am learning to access databases (specifically Sqlite and Postgres) in Rust. Database access requires a stack of libraries&lt;br&gt;
that consist of the following layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A database access API specification&lt;/strong&gt;. These specifications help library developers to provide a
consistent API to access databases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low-level database access libraries&lt;/strong&gt; built on top of the specification.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Object-Relational-Mapper (ORM)&lt;/strong&gt; to map tables to classes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Migration Assistants&lt;/strong&gt; Manage changes to the database schema&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are other types of libraries such as connection pools. However the layers mentioned above are the most common.&lt;/p&gt;

&lt;p&gt;A project that uses a database will use libraries for each layer in the stack. An example stack from&lt;br&gt;
3 popular programming languages are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Python&lt;/th&gt;
&lt;th&gt;Java&lt;/th&gt;
&lt;th&gt;Golang&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Specification&lt;/td&gt;
&lt;td&gt;&lt;a href="https://peps.python.org/pep-0249/"&gt;Python DB API&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://docs.oracle.com/javase/8/docs/technotes/guides/jdbc/"&gt;JDBC&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pkg.go.dev/database/sql"&gt;&lt;code&gt;database/sql&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DB Access Libraries&lt;/td&gt;
&lt;td&gt;&lt;a href="https://docs.python.org/3/library/sqlite3.html"&gt;sqlite3&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/xerial/sqlite-jdbc"&gt;sqlite-jdbc&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pkg.go.dev/github.com/mattn/go-sqlite3"&gt;sqlite3&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ORM&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.sqlalchemy.org"&gt;SqlAlchemy&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://hibernate.org/orm/"&gt;Hibernate&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://gorm.io/index.html"&gt;GORM&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Migration&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pypi.org/project/alembic/"&gt;Alembic&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://flywaydb.org"&gt;Flyway&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pkg.go.dev/github.com/golang-migrate/migrate/v4"&gt;golang-migrate&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  Data Access Stack in Rust
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Database Access API Specification
&lt;/h3&gt;

&lt;p&gt;Rust does not yet have an API specification. While not a specification, the closest&lt;br&gt;
alternative is &lt;a href="https://github.com/launchbadge/sqlx"&gt;SQLx&lt;/a&gt;&lt;br&gt;
which has re-implemented database access to Sqlite, Postgres and MySQL with the same interface.&lt;/p&gt;

&lt;p&gt;The lack of a specification adds a burden for libraries that have to work with different types of&lt;br&gt;
databases. For example, in Java an ORM is initialized with a JDBC driver. The ORM can use classes&lt;br&gt;
like Connection and Statement from the JDBC driver and assume that the APIs and behaviour are consistent&lt;br&gt;
across JDBC drivers.&lt;/p&gt;

&lt;p&gt;This is not the case in Rust. Consider the native drivers for &lt;a href="https://docs.rs/postgres/0.19.7/postgres/"&gt;Postgres&lt;/a&gt;&lt;br&gt;
and &lt;a href="https://docs.rs/mysql/latest/mysql/"&gt;MySQL&lt;/a&gt;. Both the drivers have a &lt;code&gt;Statement&lt;/code&gt; class.&lt;/p&gt;

&lt;p&gt;Struct &lt;a href="https://docs.rs/mysql/latest/mysql/struct.Statement.html"&gt;&lt;code&gt;mysql::Statement&lt;/code&gt;&lt;/a&gt; has the following functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pub fn columns(&amp;amp;self) -&amp;gt; &amp;amp;[Column]
pub fn params(&amp;amp;self) -&amp;gt; &amp;amp;[Column]
pub fn id(&amp;amp;self) -&amp;gt; u32
pub fn connection_id(&amp;amp;self) -&amp;gt; u32
pub fn num_params(&amp;amp;self) -&amp;gt; u16
pub fn num_columns(&amp;amp;self) -&amp;gt; u16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Struct &lt;a href="https://docs.rs/postgres/0.19.7/postgres/struct.Statement.html"&gt;&lt;code&gt;postgres::Statement&lt;/code&gt;&lt;/a&gt; has the following functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pub fn params(&amp;amp;self) -&amp;gt; &amp;amp;[Type]
pub fn columns(&amp;amp;self) -&amp;gt; &amp;amp;[Column]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;params&lt;/code&gt; returns a different type. &lt;code&gt;mysql::Statement&lt;/code&gt; has more member functions.&lt;/p&gt;

&lt;p&gt;Therefore a library that needs to connect to both MySQL and Postgres has to implement custom code for each one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Database Access Libraries
&lt;/h3&gt;

&lt;p&gt;There are access libraries for all popular databases across the spectrum of OLTP and OLAP databases.&lt;br&gt;
The main concern is that some popular access libraries are of unknown quality.&lt;/p&gt;

&lt;p&gt;The top open source OLTP databases: Postgres, MySQL and SQLite have native and well-documented drivers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Object-Relational-Mapper &amp;amp; Migration Assistants
&lt;/h3&gt;

&lt;p&gt;There are two popular ORMs which also provide support for migrations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://diesel.rs"&gt;Diesel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.sea-ql.org/SeaORM/"&gt;SeaORM&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is a &lt;a href="https://www.sea-ql.org/SeaORM/docs/internal-design/diesel/"&gt;comparison&lt;/a&gt; of both the&lt;br&gt;
projects by the maintainers of SeaORM.&lt;/p&gt;

&lt;p&gt;For starter projects both of these look like good choices. I could not find any information on which&lt;br&gt;
one scales better for web applications that have to scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Starter Stack
&lt;/h2&gt;

&lt;p&gt;I am working on a command line application that has to access either Postgres or SQLite databases.&lt;br&gt;
It runs SQL queries and does not need an ORM. I will be using SQLx as it provides the same interface&lt;br&gt;
to access both databases.&lt;/p&gt;

&lt;p&gt;The lack of a specification does not affect this project. However, it will be difficult if the&lt;br&gt;
project has to be extended to databases such as Snowflake and Presto in the future.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>sql</category>
      <category>programming</category>
      <category>database</category>
    </item>
    <item>
      <title>Two Methods to Scan for PII in Data Warehouses</title>
      <dc:creator>Rajat Venkatesh</dc:creator>
      <pubDate>Mon, 29 Nov 2021 15:06:01 +0000</pubDate>
      <link>https://dev.to/vrajat/two-methods-to-scan-for-pii-in-data-warehouses-4k4o</link>
      <guid>https://dev.to/vrajat/two-methods-to-scan-for-pii-in-data-warehouses-4k4o</guid>
      <description>&lt;p&gt;An important requirement for data privacy and protection is to find and catalog tables and columns that contain PII&lt;br&gt;
or PHI data in a data warehouse. Open source data catalogs like &lt;a href="https://datahubproject.io"&gt;Datahub&lt;/a&gt; and&lt;br&gt;
&lt;a href="https://amundsen.io"&gt;Amundsen&lt;/a&gt; enable cataloging of information in data warehouses. Moreover, tables and columns&lt;br&gt;
can be tagged including PII and type of PII tags.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The missing piece is to scan, detect and tag tables and columns with PII.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This post describes two strategies to&lt;br&gt;
scan and detect PII as well as introduce an open source application &lt;a href="https://dev.to/piicatcher"&gt;PIICatcher&lt;/a&gt; that can be used to scan&lt;br&gt;
data warehouses.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--e_U6sl8o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xgffkpvdkir46gsf3267.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--e_U6sl8o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xgffkpvdkir46gsf3267.png" alt="PIICatcher Screenshot" width="880" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is PII data?
&lt;/h2&gt;

&lt;p&gt;PII or &lt;em&gt;Personally Identifiable Information&lt;/em&gt; is generally defined as any piece of information that can be used to&lt;br&gt;
identify an individual. Traditionally, information such as SSN, mailing, email or phone numbers are considered PII. As&lt;br&gt;
technology has evolved, the scope of PII has increased to include login IDs, IP addresses, geolocation and biometric data.&lt;/p&gt;

&lt;p&gt;There are different types of PII data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sensitive&lt;/strong&gt;: is any data that can be used to directly link to an individual such as name , phone numbers, email and
mailing address.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-Sensitive&lt;/strong&gt;: is any data that can be used to indirectly linked to an individual such as location and race.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Specifically, PII as defined by Compliance laws are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GDPR&lt;/strong&gt;: PII is any data that can be used to clearly identify an individual. This also includes IP addresses,
login ID details, social media posts, digital images, geolocation and more.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CCPA&lt;/strong&gt;: Personal information is defined as information that identifies, relates to, describes, is reasonably capable
of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HIPAA&lt;/strong&gt;: HIPAA also defines PII as any type of information that relates directly or indirectly to an individual.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Beyond the above definition, domains and businesses may have specific PII data collected by them. A simple example is&lt;br&gt;
PHI (Personal Health Information) collected by the health industry. Similarly, bank account or crypto-currency wallet IDs&lt;br&gt;
can also be used to identify individuals.&lt;/p&gt;

&lt;p&gt;The following list can be considered as basic or common PII information that all industries need to manage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Phone&lt;/li&gt;
&lt;li&gt;Email&lt;/li&gt;
&lt;li&gt;Credit Card&lt;/li&gt;
&lt;li&gt;Address&lt;/li&gt;
&lt;li&gt;Person/Name&lt;/li&gt;
&lt;li&gt;Location&lt;/li&gt;
&lt;li&gt;Date&lt;/li&gt;
&lt;li&gt;Gender&lt;/li&gt;
&lt;li&gt;Nationality&lt;/li&gt;
&lt;li&gt;IP Address&lt;/li&gt;
&lt;li&gt;SSN&lt;/li&gt;
&lt;li&gt;User Name&lt;/li&gt;
&lt;li&gt;Password&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Challenges
&lt;/h2&gt;

&lt;p&gt;An example record in the patients table in&lt;br&gt;
&lt;a href="https://synthea.mitre.org/downloads"&gt;Synthetic Patient Records with COVID-19&lt;/a&gt; is:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Column Name&lt;/th&gt;
&lt;th&gt;Data&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Id&lt;/td&gt;
&lt;td&gt;f0f3bc8d-ef38-49ce-a2bd-dfdda982b271&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BIRTHDATE&lt;/td&gt;
&lt;td&gt;2017-08-24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SSN&lt;/td&gt;
&lt;td&gt;999-68-6630&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FIRST&lt;/td&gt;
&lt;td&gt;Jacinto644&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LAST&lt;/td&gt;
&lt;td&gt;Kris249&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RACE&lt;/td&gt;
&lt;td&gt;white&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ETHNICITY&lt;/td&gt;
&lt;td&gt;nonhispanic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GENDER&lt;/td&gt;
&lt;td&gt;M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BIRTHPLACE&lt;/td&gt;
&lt;td&gt;Beverly  Massachusetts  US&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ADDRESS&lt;/td&gt;
&lt;td&gt;888 Hickle Ferry Suite 38&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CITY&lt;/td&gt;
&lt;td&gt;Springfield&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;STATE&lt;/td&gt;
&lt;td&gt;Massachusetts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;COUNTY&lt;/td&gt;
&lt;td&gt;Hampden County&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZIP&lt;/td&gt;
&lt;td&gt;01106&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LAT&lt;/td&gt;
&lt;td&gt;42.151961474963535&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LON&lt;/td&gt;
&lt;td&gt;-72.59895940376188&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HEALTHCARE_EXPENSES&lt;/td&gt;
&lt;td&gt;8446.49&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HEALTHCARE_COVERAGE&lt;/td&gt;
&lt;td&gt;1499.08&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Note that most of the columns store PII data. However, it can be confusing to detect if a column stores PII data and&lt;br&gt;
the type of PII data. For example, if the scanner only scans the data in &lt;em&gt;SSN&lt;/em&gt; then it may detect it as a phone number.&lt;br&gt;
Similarly, &lt;em&gt;M&lt;/em&gt; or &lt;em&gt;F&lt;/em&gt; in the GENDER column or &lt;em&gt;white&lt;/em&gt; in RACE column, do not provide enough context to detect if it is PII&lt;br&gt;
and the type of PII data.In both these cases, it is easier to scan the column names.&lt;/p&gt;

&lt;p&gt;Conversely, the &lt;em&gt;payers&lt;/em&gt; table stores the name of health insurance companies in the &lt;strong&gt;NAME&lt;/strong&gt; column. In this case, the&lt;br&gt;
scanner has to check the data to detect that the &lt;strong&gt;NAME&lt;/strong&gt; column does &lt;em&gt;not&lt;/em&gt; contain PII data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Techniques to scan and detect PII data
&lt;/h2&gt;

&lt;p&gt;Based on the previous section, the two main strategies to scan for PII data are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Scan column and table names&lt;/li&gt;
&lt;li&gt;Scan data stored in columns&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Scan Data Warehouse Metadata
&lt;/h3&gt;

&lt;p&gt;Data engineers use descriptive names for tables and columns to help users understand the data stored in them. Therefore,&lt;br&gt;
the names of tables and columns provide clues to the type of data stored. For example,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;first_name&lt;/em&gt;, &lt;em&gt;last_name&lt;/em&gt;, &lt;em&gt;full_name&lt;/em&gt; or &lt;em&gt;name&lt;/em&gt; maybe used to store the name of a person.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;ssn&lt;/em&gt; or &lt;em&gt;social_security&lt;/em&gt; maybe used to store US SSN numbers.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;phone&lt;/em&gt; or &lt;em&gt;phone_number&lt;/em&gt; maybe used to store phone numbers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All data warehouses provide an information schema to extract schema, table and column information. For example, the&lt;br&gt;
following query can be used to get metadata from Snowflake:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    lower(c.column_name) AS col_name,
    c.comment AS col_description,
    lower(c.data_type) AS col_type,
    lower(c.ordinal_position) AS col_sort_order,
    lower(c.table_catalog) AS database,
    lower({cluster_source}) AS cluster,
    lower(c.table_schema) AS schema,
    lower(c.table_name) AS name,
    t.comment AS description,
    decode(lower(t.table_type), 'view', 'true', 'false') AS is_view
FROM
    {database}.{schema}.COLUMNS AS c
LEFT JOIN
    {database}.{schema}.TABLES t
        ON c.TABLE_NAME = t.TABLE_NAME
        AND c.TABLE_SCHEMA = t.TABLE_SCHEMA
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Regular expressions can be used to match table or column names. For example, the regular expression below detects&lt;br&gt;
a column that stores social security numbers:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;^.*(ssn|social).*$
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Scan data stored in columns
&lt;/h3&gt;

&lt;p&gt;The second strategy is to scan the data stored in columns. Within this strategy the two sub-strategies are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Regular Expressions&lt;/li&gt;
&lt;li&gt;NLP libraries such as &lt;a href="https://stanfordnlp.github.io/CoreNLP/ner.html"&gt;Stanford NER Detector&lt;/a&gt; and
&lt;a href="https://spacy.io/"&gt;Spacy&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The major disadvantage of this strategy is that NLP libraries are compute intensive. It can be prohibitively expensive&lt;br&gt;
to run NLP scanners even on moderately sized tables let alone tables of millions or billions of rows. Therefore, a&lt;br&gt;
&lt;em&gt;random&lt;/em&gt; sample of rows should be scanned. Choosing a random sample is harder than expected. Luckily, a few databases&lt;br&gt;
provide builtin functions to choose a random sample. For example, the Snowflake query below provides a random sample:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select {column_list} from {schema_name}.{table_name} TABLESAMPLE BERNOULLI (10 ROWS)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Once the rows have been extracted, then can be processed using regular expressions or NLP libraries to detect&lt;br&gt;
PII content.&lt;/p&gt;

&lt;h3&gt;
  
  
  Breaking Ties
&lt;/h3&gt;

&lt;p&gt;As explained in challenges, both techniques are required to detect PII data. However both techniques&lt;br&gt;
are susceptible to false positives and negatives. More often than not, different techniques suggest conflicting&lt;br&gt;
PII types. Detecting the right type is hard and the subject of a future blog post.&lt;/p&gt;

&lt;h2&gt;
  
  
  PIICatcher: Scan data warehouses for PII data
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/piicatcher"&gt;PIICatcher&lt;/a&gt; implements both the strategies to scan and detect PII data in the data warehouses.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3ZKSzvjR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bvedb7d0eau2stbed8fw.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3ZKSzvjR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bvedb7d0eau2stbed8fw.gif" alt="PIICather Demo" width="880" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Features
&lt;/h3&gt;

&lt;p&gt;A data warehouse can be scanned using either strategies. PIICatcher is &lt;em&gt;battery-included&lt;/em&gt; with a growing set of&lt;br&gt;
regular expressions for scanning column names as well as data. It also include &lt;a href="https://spacy.io"&gt;Spacy&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;PIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy&lt;br&gt;
scheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources.&lt;/p&gt;

&lt;p&gt;There are ingestion functions for both Datahub and Amundsen which will tag columns and tables with PII and the type of&lt;br&gt;
PII tags.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jao69mwA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fh1xhqqgu6adcts5n75e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jao69mwA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fh1xhqqgu6adcts5n75e.png" alt="Amundsen" width="880" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Check out &lt;a href="https://tokern.io/blog/lake-glue-access-analyzer"&gt;AWS Glue &amp;amp; Lake Formation Privilege Analyzer&lt;/a&gt; for an example&lt;br&gt;
of how PIIcatcher is used in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Column names and data can be scanned to detect PII in databases. Both strategies are required to reliably detect PII&lt;br&gt;
data. PIICatcher is an open source application that implements both these strategies. It can tag datasets with PII and&lt;br&gt;
the type of PII to enable data admins to take more informed decisions on data privacy and security.&lt;/p&gt;

</description>
      <category>security</category>
      <category>database</category>
      <category>nlp</category>
    </item>
    <item>
      <title>Open Source SQL Parsers</title>
      <dc:creator>Rajat Venkatesh</dc:creator>
      <pubDate>Fri, 08 Oct 2021 10:38:18 +0000</pubDate>
      <link>https://dev.to/vrajat/open-source-sql-parsers-1lgj</link>
      <guid>https://dev.to/vrajat/open-source-sql-parsers-1lgj</guid>
      <description>&lt;p&gt;Parsing SQL queries provides superpowers for monitoring data health. This post describes how to get started on parsing &lt;br&gt;
SQL for data observability. &lt;/p&gt;

&lt;p&gt;Query history of a data warehouse is a rich source of information to glean how data is used in your organization.&lt;br&gt;
Many aspects of data observability can be tracked by analyzing query history. For example, query history analysis can&lt;br&gt;
extract:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Popular tables and columns&lt;/li&gt;
&lt;li&gt;Unused tables and columns&lt;/li&gt;
&lt;li&gt;Column-level lineage&lt;/li&gt;
&lt;li&gt;Freshness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--aTjsujqo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i987pvc5sc212z8oazp6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--aTjsujqo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i987pvc5sc212z8oazp6.png" alt="SQL Execution Plan"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These statistics also help to automate common data engineering tasks like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backup and Disaster Recovery&lt;/li&gt;
&lt;li&gt;Triage Data Quality issues&lt;/li&gt;
&lt;li&gt;Track sensitive data and how they are used.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Challenges and Approaches
&lt;/h2&gt;

&lt;p&gt;SQL language is an ISO/IEC standard and the latest version is &lt;a href="https://blog.ansi.org/2018/10/sql-standard-iso-iec-9075-2016-ansi-x3-135/"&gt;SQL2016&lt;/a&gt;.&lt;br&gt;
However, every database implements the standard&lt;br&gt;
differently, uses different function names for the same operation, and has extensions to access specific custom features.&lt;br&gt;
Therefore, there isn’t one SQL parser for dialects of all popular databases and data warehouses.&lt;/p&gt;

&lt;p&gt;Regular expressions is a popular approach to extract information from SQL statements. However, regular expressions quickly&lt;br&gt;
become too complex to handle common features like WITH, sub-queries, windows clauses, aliases and quotes.&lt;br&gt;
&lt;a href="https://github.com/andialbrecht/sqlparse"&gt;sqlparse&lt;/a&gt; is a popular python package that uses regular expressions to parse&lt;br&gt;
SQL.&lt;/p&gt;

&lt;p&gt;An alternate approach is to implement the SQL grammar using parser generators like &lt;a href="https://antlr.org"&gt;ANTLR&lt;/a&gt;. There&lt;br&gt;
are similar open source parser generators in other popular languages.&lt;/p&gt;

&lt;p&gt;There are multiple projects that maintain parsers for popular open source databases like MySQL and Postgres. For other&lt;br&gt;
open source databases, the grammar can be extracted from the open-source project. For commercial databases, the only&lt;br&gt;
option is to reverse engineer the complete grammar. There are SQL parser/optimizer platforms like &lt;a href="https://calcite.apache.org/"&gt;Apache Calcite&lt;/a&gt;&lt;br&gt;
that help to reduce the effort to implement the SQL dialect of your choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Source Parsers
&lt;/h2&gt;

&lt;p&gt;Some popular open source databases and data warehouses are:&lt;/p&gt;

&lt;h3&gt;
  
  
  MySQL/MariaDB
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/pingcap/parser"&gt;Pingcap&lt;/a&gt; parser is a MySQL parser in Go.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/phpmyadmin/sql-parser"&gt;SQL Parser&lt;/a&gt; in phpmyadmin is a validating SQL lexer and parser with a focus on MySQL dialect.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Postgres
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/pganalyze/libpg_query"&gt;libpg_query&lt;/a&gt; extracts the parser (written in C) from the postgres project and&lt;br&gt;
packages it as a stand-alone library. This library is wrapped in other languages by other projects like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Python&lt;/em&gt;: &lt;a href="https://github.com/lelit/pglast"&gt;pglast&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Ruby&lt;/em&gt; : &lt;a href="https://github.com/pganalyze/pg_query"&gt;pg_query&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Golang&lt;/em&gt;: &lt;a href="https://github.com/pganalyze/pg_query_go"&gt;pg_query_go&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;JS&lt;/em&gt;: &lt;a href="https://github.com/pyramation/pgsql-parser"&gt;psql-parser&lt;/a&gt; in Node and &lt;a href="https://github.com/pganalyze/pg-query-emscripten"&gt;pg-query-emscripten&lt;/a&gt; in the browser&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Rust&lt;/em&gt;: &lt;a href="https://github.com/paupino/pg_query.rs"&gt;pg_query.rs&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Multiple Engines
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/uber/queryparser"&gt;queryparser&lt;/a&gt; implements Apache Hive, Presto/Trino and Vertica dialects.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/google/zetasql"&gt;zetasql&lt;/a&gt; implements BigQuery, Spanner, and Dataflow dialects.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Generic Parsers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Python&lt;/em&gt;: &lt;a href="https://github.com/andialbrecht/sqlparse"&gt;sqlparse&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Rust&lt;/em&gt;: &lt;a href="https://github.com/sqlparser-rs/sqlparser-rs"&gt;sqlparser-rs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Python&lt;/em&gt;: &lt;a href="https://github.com/klahnakoski/mo-sql-parsing"&gt;mo-sql-parseing&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Platforms
&lt;/h2&gt;

&lt;p&gt;Parser/Optimizer platforms implement the common SQL language features and allow customization as first-class feature&lt;br&gt;
of the platform. Two popular open source projects are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://calcite.apache.org/"&gt;Apache Calcite&lt;/a&gt; is a popular parser/optimizer that is used in popular databases and query engines
like &lt;a href="https://hive.apache.org"&gt;Apache Hive&lt;/a&gt;, &lt;a href="https://www.blazingsql.com/"&gt;BlazingSQL&lt;/a&gt; and &lt;a href="https://calcite.apache.org/docs/powered_by.html"&gt;many others&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/JSQLParser/JSqlParser"&gt;JSQLParser&lt;/a&gt; can parse multiple SQL dialects like MySQL, Postgres and Oracle.
The grammar can be modified to support other SQL dialects.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Apache Calcite allows customizations at various points of the parsing process.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Parser rules can be &lt;a href="https://stackoverflow.com/a/44467850"&gt;changed&lt;/a&gt; to support custom syntax.&lt;/li&gt;
&lt;li&gt;Conventions such as quotes vs double quotes, case sensitivity.&lt;/li&gt;
&lt;li&gt;Add optimizer rules.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Apache Calcite also provides visitors for traversing the SQL execution plan. Visitor pattern is an algorithm to traverse&lt;br&gt;
a SQL plan.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical tips to Getting Started
&lt;/h2&gt;

&lt;p&gt;There are many abandoned open source SQL parsers. The first filter is to use a project that will be supported in the&lt;br&gt;
future. For popular databases such as Postgres and MySQL/MariaDB, there are parsers available in multiple programming&lt;br&gt;
languages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if there is no parser for your database?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most teams do not create a parser from scratch. A popular option is to use the Postgres parser and then add custom&lt;br&gt;
SQL syntax. AWS Redshift, Vertica and DuckDB are examples. Use a Postgres SQL parser to parse query history of these&lt;br&gt;
databases to parse the majority of the queries.&lt;/p&gt;

&lt;p&gt;Many queries will fail to parse such as UNLOAD in AWS Redshift. If it is important to also parse the variants, consider&lt;br&gt;
modifying the projects to accept the custom grammar OR use a platform like Apache Calcite.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;There is a demand for SQL parsers to build reports on database or data warehouse usage. There are a number of good&lt;br&gt;
open-source projects. However, there is a steep learning curve to use these projects and in many cases a project may not&lt;br&gt;
fit your specific requirements.&lt;/p&gt;

&lt;p&gt;Struggling with parsing query history? &lt;a href="https://rajatvenkatesh.typeform.com/to/VtzHo2qe"&gt;Get in touch&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sql</category>
      <category>opensource</category>
      <category>database</category>
    </item>
    <item>
      <title>Data Governance 101</title>
      <dc:creator>Rajat Venkatesh</dc:creator>
      <pubDate>Fri, 04 Sep 2020 10:08:09 +0000</pubDate>
      <link>https://dev.to/vrajat/data-governance-101-320</link>
      <guid>https://dev.to/vrajat/data-governance-101-320</guid>
      <description>&lt;p&gt;&lt;em&gt;This post is a recap of my &lt;a href="https://tokern.io/decks/data-governance-101/slides" rel="noopener noreferrer"&gt;presentation&lt;/a&gt; and is a semi-autobiographical journey in helping data teams setup data governance frameworks.&lt;/em&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  What is Data Governance?
&lt;/h1&gt;

&lt;p&gt;The first step is to understand what is data governance. Data Governance is an overloaded term and means different things to different people. It has been helpful to define Data Governance based on the outcomes it is supposed to deliver. In my case, Data Governance is any task required for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compliance: Data life cycle and usage is in accordance with laws and regulations.&lt;/li&gt;
&lt;li&gt;Privacy: Protect data as per regulations and user expectations.&lt;/li&gt;
&lt;li&gt;Security: Data &amp;amp; data infrastructure is adequately protected.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Why is Data Governance hard?
&lt;/h1&gt;

&lt;p&gt;Compliance, Privacy, and Security are different approaches to ensure that data collectors and processors do not gain unregulated insights. It is hard to ensure that the right data governance framework is in place to meet this goal. An &lt;a href="https://gawker.com/the-public-nyc-taxicab-database-that-accidentally-track-1646724546" rel="noopener noreferrer"&gt;interesting example of an unexpected insight&lt;/a&gt; is the sequence of events leading to leakage of taxi cab tipping history of celebrities.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F55jdtb1nlb86w4tmzavp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F55jdtb1nlb86w4tmzavp.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Paparazzi took photos of celebrities in New York City using taxi cabs. The photos had geo-locations and timestamps along with identifying information about taxi cabs like registration and medallion numbers. Independently, the Taxi Commission released an anonymized dataset of taxi trips with time, medallion numbers, fares, and tips. It was possible to link the metadata from photographs and the taxi usage dataset to get the tips given by celebrities.&lt;/p&gt;

&lt;p&gt;Data Governance is hard because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There is too much data&lt;/li&gt;
&lt;li&gt;There is too much complexity in data infrastructure.&lt;/li&gt;
&lt;li&gt;There is no context for data usage.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  There is too much data
&lt;/h2&gt;

&lt;p&gt;The trend is towards businesses collecting more data from users and sharing more data with each other. For example, the image below lists some of the &lt;a href="https://knowledge.wharton.upenn.edu/article/data-shared-sold-whats-done/" rel="noopener noreferrer"&gt;companies PayPal has data sharing agreements&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ffjkp69n76kuomda41ghd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ffjkp69n76kuomda41ghd.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As companies share and hoard more data, it is possible that they will link these datasets to garner insights that were unexpected by the user.&lt;/p&gt;

&lt;h2&gt;
  
  
  There is too much complexity
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fdnnhjxtp33zeygr21rz8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fdnnhjxtp33zeygr21rz8.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://mattturck.com/data2019/" rel="noopener noreferrer"&gt;Data &amp;amp; AI Landscape&lt;/a&gt; lists approximately 1500 open-source and commercial technologies. In a small survey, I found that a simple data infrastructure uses 8-10 components. Data and security teams have to ensure similar capabilities in compliance and security across all the parts of the data infrastructure. This is very hard to accomplish.&lt;/p&gt;

&lt;h2&gt;
  
  
  There is no context for data usage
&lt;/h2&gt;

&lt;p&gt;Analytics, Data Science and AI objectives compete with compliance, privacy, and security. A blanket “Yes” or “No” access policies do not work. More context is required to enforce access policies appropriately:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who is using the data?&lt;/li&gt;
&lt;li&gt;What purpose?&lt;/li&gt;
&lt;li&gt;When?&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  How to get started on Data Governance?
&lt;/h1&gt;

&lt;p&gt;I have found it helpful when working with teams on Data Governance by answering these basic questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where is my data?&lt;/li&gt;
&lt;li&gt;Who has access to data?&lt;/li&gt;
&lt;li&gt;How is the data used?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Typically teams care only about sensitive data. Every company and team will have a different definition of what is sensitive. Common ones are PII, PHI, or financial data.&lt;/p&gt;

&lt;p&gt;It is also important to ensure the process of obtaining answers is automated. Automation will ensure that the data governance framework is relevant and useful when required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where is my sensitive data?
&lt;/h2&gt;

&lt;p&gt;A data catalog, scanner, and data lineage application are required to keep track of sensitive data.&lt;/p&gt;

&lt;p&gt;An example of a data catalog and scanner is PIICatcher. PIICatcher can scan databases and detect PII data. It can be extended to detect other types of sensitive data. The image shows the metadata stored by PIICatcher after scanning data in AWS S3 in the AWS Glue Catalog.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fa3okyhut3srxb1wdlivz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fa3okyhut3srxb1wdlivz.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Typically it is not practical to scan all datasets. Instead, it is sufficient to scan base datasets and then build a data lineage graph to track sensitive data. A library like data-lineage can build a DAG from query history using a graphing library. The DAG can be used to visualize the graph or process it programmatically.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fmfx467ks3lrvzpqamxac.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fmfx467ks3lrvzpqamxac.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Who has access to my sensitive data?
&lt;/h2&gt;

&lt;p&gt;Most databases have an information schema that stores the privileges of users and roles. This table can be joined with a data catalog where columns with sensitive data are tagged to get a list of users and roles that have access to sensitive data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fh688ite6j8rh0eyjleav.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fh688ite6j8rh0eyjleav.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How is sensitive data used?
&lt;/h2&gt;

&lt;p&gt;The first is to log usage across all databases. Most databases store &lt;a href="https://tokern.io/blog/data-lineage-on-snowflake/" rel="noopener noreferrer"&gt;query history in the information schema&lt;/a&gt;. Big Data technologies like Presto provide hooks to capture usage. It is not advisable to log usage from production databases. Instead, proxies should be used for human access. &lt;a href="https://tokern.io/blog/proxysql-database-audit/" rel="noopener noreferrer"&gt;Proxies&lt;/a&gt; can log usage and send the information to a log aggregator where it can be analyzed.&lt;/p&gt;

&lt;p&gt;Conclusion&lt;br&gt;
Data Compliance, Privacy &amp;amp; Security is a journey. Data governance is hard but can be tackled by starting with simple questions and using automation extensively.&lt;/p&gt;

</description>
      <category>database</category>
      <category>datagovernance</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>FAQs on Data Lineage</title>
      <dc:creator>Rajat Venkatesh</dc:creator>
      <pubDate>Wed, 22 Apr 2020 12:35:16 +0000</pubDate>
      <link>https://dev.to/vrajat/faqs-on-data-lineage-akp</link>
      <guid>https://dev.to/vrajat/faqs-on-data-lineage-akp</guid>
      <description>&lt;h2&gt;
  
  
  What is meant by data lineage ?
&lt;/h2&gt;

&lt;p&gt;In Biology, lineage is a sequence of species each of which is considered to have evolved from its predecessor.&lt;/p&gt;

&lt;p&gt;Similarly, Data Lineage is a sequence of transformations through intermediary systems to a final data set. Each data set&lt;br&gt;
is considered to have been created from its predecessor through a specific transformation. A transformation maybe a&lt;br&gt;
SQL query or a program in a language such as Python or Scala. Data Lineage can be at any granular level - schema, table&lt;br&gt;
or column.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why is data lineage important ?
&lt;/h2&gt;

&lt;p&gt;Data Lineage is important because it enables important data governance functions such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business Rules Verification&lt;/li&gt;
&lt;li&gt;Change Impact Analysis&lt;/li&gt;
&lt;li&gt;Data Quality Verification&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What is a data lineage tool ?
&lt;/h2&gt;

&lt;p&gt;A Data Lineage Tool captures metadata of all data transformations, organizes the metadata in a graph and provides access&lt;br&gt;
 to the graph through visual interfaces and programmable APIs.&lt;/p&gt;

&lt;p&gt;In general data lineage tools use two techniques:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Push: ETL platforms push metadata to a data lineage tool during transformations.&lt;/li&gt;
&lt;li&gt;Pull: Data Lineage tools scan logs and query history from databases and data lakes and generate lineage after the event.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some data lineage tools use both techniques.&lt;/p&gt;

&lt;h2&gt;
  
  
  Are there open source data catalog tools ?
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9"&gt;Amundsen by Lyft&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://netflixtechblog.com/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520"&gt;Metacat by Netflix&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://tokern.io/data-lineage/"&gt;Data Lineage&lt;/a&gt; by Tokern.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How do you build data lineage solution for databases ?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Choose one of the open source data catalog projects such as &lt;a href="https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9"&gt;Amundsen&lt;/a&gt;, &lt;a href="https://atlas.apache.org/"&gt;Apache Atlas&lt;/a&gt; or &lt;a href="https://tokern.io/data-lineage/"&gt;Data Lineage&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Follow installation instructions of the project. Some require a Hadoop cluster.&lt;/li&gt;
&lt;li&gt;Integrate ETL tools, databases and data engines with the data lineage tool.&lt;/li&gt;
&lt;li&gt;Integrate ETL tools, databases and data engines with the data lineage tool.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>database</category>
      <category>datagovernance</category>
      <category>lineage</category>
    </item>
    <item>
      <title>Open Source Data Lineage App in Python</title>
      <dc:creator>Rajat Venkatesh</dc:creator>
      <pubDate>Tue, 31 Mar 2020 09:52:54 +0000</pubDate>
      <link>https://dev.to/vrajat/open-source-data-lineage-app-in-python-1nf5</link>
      <guid>https://dev.to/vrajat/open-source-data-lineage-app-in-python-1nf5</guid>
      <description>&lt;p&gt;Hello,&lt;/p&gt;

&lt;p&gt;I want to show an open source Python project &lt;a href="https://github.com/tokern/data-lineage"&gt;&lt;em&gt;data-lineage&lt;/em&gt;&lt;/a&gt; to visualize and analyze data lineage. The project was developed in collaboration with data teams on data governance initiatives over the last couple of years. &lt;/p&gt;

&lt;p&gt;There are a lot of open source and commercial tools to capture data lineage. However there are two main problems expressed by data engineers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The projects require a lot of effort to get started and maintain.&lt;/li&gt;
&lt;li&gt;Requires constant discipline in capturing and sending all the metadata.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both these factors result in incomplete projects and lost opportunities in improving performance, ROI and data quality. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;data-lineage&lt;/em&gt; solves these problems by choosing the following goals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;providing fast access to data lineage&lt;/li&gt;
&lt;li&gt;simple setup &lt;/li&gt;
&lt;li&gt;analysis of the lineage using a graph library&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To achieve these goals, data lineage has the following features:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Generate data lineage from query history.&lt;/strong&gt; Most databases maintain query history for a few days. Therefore the setup costs of an infrastructure to capture and store metadata is minimal. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use networkx graph library to create a DAG of the lineage.&lt;/strong&gt; Networkx graphs provide programmatic access to data lineage providing rich opportunities to analyze data lineage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Plotly to visualize the graph with tool tips and other rich annotations.&lt;/strong&gt; Plotly provides a number of features to provide rich graphs with tool tips, color coding and weights based on different attributes of the graph.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can get a &lt;a href="https://tokern.io/docs/data-lineage/example/"&gt;data lineage graph&lt;/a&gt; with less than 10 lines of Python code in a Jupyter Notebook.&lt;/p&gt;

&lt;p&gt;Right now data-lineage supports postgres and support for more databases is planned. &lt;/p&gt;

&lt;p&gt;I appreciate any feedback and please give it a try if you need data lineage for your work.&lt;/p&gt;

&lt;p&gt;Links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/tokern/data-lineage"&gt;Github&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tokern.io/data-lineage/"&gt;Landing Page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tokern.io/docs/data-lineage/"&gt;Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tokern.io/blog/data-lineage-on-redshift/"&gt;Blog&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>database</category>
      <category>opensource</category>
      <category>postgres</category>
      <category>python</category>
    </item>
    <item>
      <title>Data Lineage On Redshift</title>
      <dc:creator>Rajat Venkatesh</dc:creator>
      <pubDate>Wed, 11 Mar 2020 08:52:10 +0000</pubDate>
      <link>https://dev.to/vrajat/data-lineage-on-redshift-39pk</link>
      <guid>https://dev.to/vrajat/data-lineage-on-redshift-39pk</guid>
      <description>&lt;p&gt;Data Lineage is important for data governance and security. In Data warehouses and data lakes, a team of data engineers maintain a canonical set of base tables. The source of data of these base tables&lt;br&gt;
maybe events from the product, logs or third-party data from SalesForce, Google Analytics or Segment.&lt;/p&gt;

&lt;p&gt;Data analysts and scientists use the base tables for their reports and machine learning algorithms. They may create derived tables to help with their work. It is not uncommon for a data warehouse or lake to have hundreds or thousands of tables. In such a scenario it is important to use automation and visual tools to track data lineage.&lt;/p&gt;

&lt;p&gt;This post describes automated visualization of data lineage in AWS Redshift from query logs of the data warehouse. The techniques are applicable to other technologies as well.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--566lAguM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/tokern"&gt;
        tokern
      &lt;/a&gt; / &lt;a href="https://github.com/tokern/data-lineage"&gt;
        data-lineage
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Generate and Visualize Data Lineage from query history
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;h1&gt;
Tokern Lineage Engine&lt;/h1&gt;
&lt;p&gt;&lt;a href="https://circleci.com/gh/tokern/data-lineage" rel="nofollow"&gt;&lt;img src="https://camo.githubusercontent.com/6d2ecf6120b223b0e12715795b63a14dfee24ac0b80cedbbaf4f5dacf27873f5/68747470733a2f2f636972636c6563692e636f6d2f67682f746f6b65726e2f646174612d6c696e656167652e7376673f7374796c653d737667" alt="CircleCI"&gt;&lt;/a&gt;
&lt;a href="https://codecov.io/gh/tokern/data-lineage" rel="nofollow"&gt;&lt;img src="https://camo.githubusercontent.com/f2140bc3d2f5f2083905b89fa1206ce0e949bf52543a80a3480916afba22e738/68747470733a2f2f636f6465636f762e696f2f67682f746f6b65726e2f646174612d6c696e656167652f6272616e63682f6d61737465722f67726170682f62616467652e737667" alt="codecov"&gt;&lt;/a&gt;
&lt;a href="https://pypi.python.org/pypi/data-lineage" rel="nofollow"&gt;&lt;img src="https://camo.githubusercontent.com/4d790bcfbc4eb498710596d44bc540ce8e02130348fa493b5974f930adb3c106/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f646174612d6c696e656167652e737667" alt="PyPI"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/data-lineage/" rel="nofollow"&gt;&lt;img src="https://camo.githubusercontent.com/4253349f1fe7335f231f2b938bb05d3760b73c7ec09b1a442f176f48a14a46b4/68747470733a2f2f696d672e736869656c64732e696f2f707970692f6c2f646174612d6c696e656167652e737667" alt="image"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/data-lineage/" rel="nofollow"&gt;&lt;img src="https://camo.githubusercontent.com/a06be778b5d351d94935a3d5638ec242e318a85794457b183e5b08a247e02000/68747470733a2f2f696d672e736869656c64732e696f2f707970692f707976657273696f6e732f646174612d6c696e656167652e737667" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Tokern Lineage Engine is &lt;em&gt;fast&lt;/em&gt; and &lt;em&gt;easy to use&lt;/em&gt; application to collect, visualize and analyze
column-level data lineage in databases, data warehouses and data lakes in AWS and GCP.&lt;/p&gt;
&lt;p&gt;Tokern Lineage helps you browse column-level data lineage&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;visually using &lt;a href="https://github.com/quantumblacklabs/kedro-viz"&gt;kedro-viz&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;analyze lineage graphs programmatically using the powerful &lt;a href="https://networkx.org/" rel="nofollow"&gt;networkx graph library&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Demo of Tokern Lineage App&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://user-images.githubusercontent.com/1638298/118261607-688a7100-b4d1-11eb-923a-5d2407d6bd8d.gif"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wTkgHsQC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://user-images.githubusercontent.com/1638298/118261607-688a7100-b4d1-11eb-923a-5d2407d6bd8d.gif" alt="data-lineage"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Checkout an &lt;a href="http://tokern.io/docs/data-lineage/example/" rel="nofollow"&gt;example data lineage notebook&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Check out &lt;a href="https://tokern.io/blog/data-lineage-on-redshift/" rel="nofollow"&gt;the post on using data lineage for cost control&lt;/a&gt; for an
example of how data lineage can be used in production.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
Quick Start&lt;/h2&gt;
&lt;h3&gt;
Install a demo of using Docker and Docker Compose&lt;/h3&gt;
&lt;p&gt;Download the docker-compose file from Github repository.&lt;/p&gt;
&lt;div class="snippet-clipboard-content position-relative overflow-auto"&gt;&lt;pre&gt;&lt;code&gt;# in a new directory run
wget https://raw.githubusercontent.com/tokern/data-lineage/master/install-manifests/docker-compose/catalog-demo.yml
# or run
curl https://raw.githubusercontent.com/tokern/data-lineage/master/install-manifests/docker-compose/tokern-lineage-engine.yml -o docker-compose.yml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Run docker-compose&lt;/p&gt;
&lt;div class="snippet-clipboard-content position-relative overflow-auto"&gt;&lt;pre&gt;&lt;code&gt;docker-compose up -d
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Check that the containers are running.&lt;/p&gt;
&lt;div class="snippet-clipboard-content position-relative overflow-auto"&gt;
&lt;pre&gt;&lt;code&gt;docker ps
CONTAINER ID   IMAGE                                    CREATED        STATUS       PORTS                    NAMES
3f4e77845b81   tokern/data-lineage-viz:latest   ...   4 hours ago    Up 4&lt;/code&gt;&lt;/pre&gt;…&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/tokern/data-lineage"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;h2&gt;
  
  
  Workload System of Record
&lt;/h2&gt;

&lt;p&gt;A system of record of all activity in databases is a prerequisite for any type of analysis. For example, AWS Redshift has many system tables and views that record all the activity in the database. Since these tables retain data for a limited time, it is important to persist the data. AWS provides scripts to store the data in tables within Redshift itself. For performance analysis the query log stored in STL_QUERY and STL_QUERYTEXT are the most important.&lt;/p&gt;

&lt;p&gt;Tokern reads and processes the records in STL_QUERY &amp;amp; STL_QUERYTEXT at regular intervals. It adds the following information for every query:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Type of query such as SELECT, DML, DDL, COPY, UNLOAD etc&lt;/li&gt;
&lt;li&gt;Source tables &amp;amp; columns which are read by the query if applicable.&lt;/li&gt;
&lt;li&gt;Source files in S3 for COPY queries.&lt;/li&gt;
&lt;li&gt;Target table &amp;amp; columns where the data was loaded if applicable.&lt;/li&gt;
&lt;li&gt;Target files in S3 for UNLOAD queries.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Data Lineage
&lt;/h2&gt;

&lt;p&gt;Tokern uses the system of record to build a network graph for every table &amp;amp; pipeline. An example for infallible_galois is visualized below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0Syg3Sx7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/kh1ltwaf5u79qqq20435.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0Syg3Sx7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/kh1ltwaf5u79qqq20435.png" alt="infallible_galois" width="700" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the network graph, data moves from left to right. Every node (or circle) represents a table. There is an edge (left to right) to a node if the data load reads from that table. A table can be loaded with data from many tables. For example, the data load for hopeful_matsumoto reads data from hungry_margulis.&lt;br&gt;
The graph can be analyzed programmatically or used to create interactive charts to help data engineers glean actionable information.&lt;/p&gt;
&lt;h2&gt;
  
  
  Reduce copies of sensitive data
&lt;/h2&gt;

&lt;p&gt;The security engineers at Company S used a scanning tool to find all tables and columns that stored sensitive data like&lt;br&gt;
PII, financial and business sensitive data. They found that there were many more copies than anticipated. An immediate goal&lt;br&gt;
was to reduce the number of copies to reduce the security vulnerability surface area. To achieve the goal they had to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identify the owners of the copy.&lt;/li&gt;
&lt;li&gt;Understand their motivation to create copies.&lt;/li&gt;
&lt;li&gt;Work with owners to eliminate copies or use masked copies of the data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The challenges to achieving these goals were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Determine the owners of the data.&lt;/li&gt;
&lt;li&gt;Number of conversations required to understand the need for copies and agree on a workaround.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Due to the sheer number of tables, the task was daunting and required an automation tool.&lt;/p&gt;
&lt;h2&gt;
  
  
  Table specific Network Graphs
&lt;/h2&gt;

&lt;p&gt;Tokern provided security team with actionable information to discover&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Owners of copies of data&lt;/li&gt;
&lt;li&gt;A profile consisting of the queries used to create the table and read it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The profile helped the security engineers prepare for the conversation with the owners. On a number of occasions,&lt;br&gt;
duplicate tables were eliminated. In other cases, the workflow was changed to eliminate temporary tables for&lt;br&gt;
intermediate data.&lt;/p&gt;

&lt;p&gt;In the example below for &lt;em&gt;upbeat_ellis&lt;/em&gt;, two intermediate tables with sensitive data were eliminated from the workflow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--a-kFTepY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/tvamvnh45jqsz3xjkzra.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--a-kFTepY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/tvamvnh45jqsz3xjkzra.png" alt="upbeat_ellis" width="700" height="450"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Large Network graphs
&lt;/h2&gt;

&lt;p&gt;Another example of a network graph for &lt;em&gt;crazy_terseshkova&lt;/em&gt; is shown below.&lt;br&gt;
As seen, the data path is much more complicated with close to a hundred tasks required to eventually load data into the table. The graph is interactive and can be panned &amp;amp; zoomed to focus on specific parts. Metadata of tasks such as start time, run time and table names are shown by hovering over the nodes in the network graph to deal with such&lt;br&gt;
 complicated data pipelines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HL0LPoO6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/evflq0107so5ni8msfxo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HL0LPoO6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/evflq0107so5ni8msfxo.png" alt="tereshkova" width="700" height="450"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Security Engineers need data lineage and related automation tools to manage database security. All databases provide a workload system of record. &lt;a href="https://tokern.io/lineage/"&gt;Data Lineage tools&lt;/a&gt; can use this information to visualize data lineage as well as use rules to automate checks. &lt;/p&gt;

&lt;p&gt;If open source data lineage tools are interesting to you, check out the lineage github project.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--566lAguM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/tokern"&gt;
        tokern
      &lt;/a&gt; / &lt;a href="https://github.com/tokern/data-lineage"&gt;
        data-lineage
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Generate and Visualize Data Lineage from query history
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;h1&gt;
Tokern Lineage Engine&lt;/h1&gt;
&lt;p&gt;&lt;a href="https://circleci.com/gh/tokern/data-lineage" rel="nofollow"&gt;&lt;img src="https://camo.githubusercontent.com/6d2ecf6120b223b0e12715795b63a14dfee24ac0b80cedbbaf4f5dacf27873f5/68747470733a2f2f636972636c6563692e636f6d2f67682f746f6b65726e2f646174612d6c696e656167652e7376673f7374796c653d737667" alt="CircleCI"&gt;&lt;/a&gt;
&lt;a href="https://codecov.io/gh/tokern/data-lineage" rel="nofollow"&gt;&lt;img src="https://camo.githubusercontent.com/f2140bc3d2f5f2083905b89fa1206ce0e949bf52543a80a3480916afba22e738/68747470733a2f2f636f6465636f762e696f2f67682f746f6b65726e2f646174612d6c696e656167652f6272616e63682f6d61737465722f67726170682f62616467652e737667" alt="codecov"&gt;&lt;/a&gt;
&lt;a href="https://pypi.python.org/pypi/data-lineage" rel="nofollow"&gt;&lt;img src="https://camo.githubusercontent.com/4d790bcfbc4eb498710596d44bc540ce8e02130348fa493b5974f930adb3c106/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f646174612d6c696e656167652e737667" alt="PyPI"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/data-lineage/" rel="nofollow"&gt;&lt;img src="https://camo.githubusercontent.com/4253349f1fe7335f231f2b938bb05d3760b73c7ec09b1a442f176f48a14a46b4/68747470733a2f2f696d672e736869656c64732e696f2f707970692f6c2f646174612d6c696e656167652e737667" alt="image"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/data-lineage/" rel="nofollow"&gt;&lt;img src="https://camo.githubusercontent.com/a06be778b5d351d94935a3d5638ec242e318a85794457b183e5b08a247e02000/68747470733a2f2f696d672e736869656c64732e696f2f707970692f707976657273696f6e732f646174612d6c696e656167652e737667" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Tokern Lineage Engine is &lt;em&gt;fast&lt;/em&gt; and &lt;em&gt;easy to use&lt;/em&gt; application to collect, visualize and analyze
column-level data lineage in databases, data warehouses and data lakes in AWS and GCP.&lt;/p&gt;
&lt;p&gt;Tokern Lineage helps you browse column-level data lineage&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;visually using &lt;a href="https://github.com/quantumblacklabs/kedro-viz"&gt;kedro-viz&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;analyze lineage graphs programmatically using the powerful &lt;a href="https://networkx.org/" rel="nofollow"&gt;networkx graph library&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Demo of Tokern Lineage App&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://user-images.githubusercontent.com/1638298/118261607-688a7100-b4d1-11eb-923a-5d2407d6bd8d.gif"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wTkgHsQC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://user-images.githubusercontent.com/1638298/118261607-688a7100-b4d1-11eb-923a-5d2407d6bd8d.gif" alt="data-lineage"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Checkout an &lt;a href="http://tokern.io/docs/data-lineage/example/" rel="nofollow"&gt;example data lineage notebook&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Check out &lt;a href="https://tokern.io/blog/data-lineage-on-redshift/" rel="nofollow"&gt;the post on using data lineage for cost control&lt;/a&gt; for an
example of how data lineage can be used in production.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
Quick Start&lt;/h2&gt;
&lt;h3&gt;
Install a demo of using Docker and Docker Compose&lt;/h3&gt;
&lt;p&gt;Download the docker-compose file from Github repository.&lt;/p&gt;
&lt;div class="snippet-clipboard-content position-relative overflow-auto"&gt;&lt;pre&gt;&lt;code&gt;# in a new directory run
wget https://raw.githubusercontent.com/tokern/data-lineage/master/install-manifests/docker-compose/catalog-demo.yml
# or run
curl https://raw.githubusercontent.com/tokern/data-lineage/master/install-manifests/docker-compose/tokern-lineage-engine.yml -o docker-compose.yml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Run docker-compose&lt;/p&gt;
&lt;div class="snippet-clipboard-content position-relative overflow-auto"&gt;&lt;pre&gt;&lt;code&gt;docker-compose up -d
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Check that the containers are running.&lt;/p&gt;
&lt;div class="snippet-clipboard-content position-relative overflow-auto"&gt;
&lt;pre&gt;&lt;code&gt;docker ps
CONTAINER ID   IMAGE                                    CREATED        STATUS       PORTS                    NAMES
3f4e77845b81   tokern/data-lineage-viz:latest   ...   4 hours ago    Up 4&lt;/code&gt;&lt;/pre&gt;…&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/tokern/data-lineage"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>database</category>
      <category>lineage</category>
      <category>aws</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Secure Data Lake with AWS Lake Formation Tutorial</title>
      <dc:creator>Rajat Venkatesh</dc:creator>
      <pubDate>Mon, 27 Jan 2020 11:03:41 +0000</pubDate>
      <link>https://dev.to/vrajat/secure-data-lake-with-aws-lake-formation-tutorial-d82</link>
      <guid>https://dev.to/vrajat/secure-data-lake-with-aws-lake-formation-tutorial-d82</guid>
      <description>&lt;p&gt;&lt;a href="https://aws.amazon.com/lake-formation/"&gt;AWS Lake Formation&lt;/a&gt; helps to build a secure data lake on data in AWS S3.&lt;br&gt;
This blog will help you get started by describing the steps to setup a basic data lake with S3, Glue, Lake Formation&lt;br&gt;
and Athena in AWS. The tutorial will use&lt;br&gt;
&lt;a href="https://registry.opendata.aws/nyc-tlc-trip-records-pds/"&gt;New York City Taxi and Limousine Commission (TLC) Trip Record Data&lt;/a&gt;&lt;br&gt;
as the data set.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--EUVUaNwh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://tokern.io/static/085cef4d34b2ada62c0bf1015cf6d48a/ec5a1/aws_lake_formation.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--EUVUaNwh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://tokern.io/static/085cef4d34b2ada62c0bf1015cf6d48a/ec5a1/aws_lake_formation.webp" alt="Lake Formation Diagram" width="830" height="622"&gt;&lt;/a&gt;&lt;br&gt;Image by &lt;a href="https://www.awsgeek.com/posts/AWS-Lake-Formation/index.html"&gt;Jerry Hargrove&lt;/a&gt;
  &lt;/p&gt;

&lt;p&gt;At the end of the tutorial, you will have a data lake setup with Lake Formation. &lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Setup AWS Lake Formation using instructions in &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/getting-started-setup.html"&gt;Getting Started Tutorial&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;After the tutorial, the following users and roles are setup:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Administrator&lt;/strong&gt; is the Administrator IAM User&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NycWorkflowRole&lt;/strong&gt; is the IAM Role for Workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;lakeadmin&lt;/strong&gt; is the Data Lake Administrator&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Upgrade to Lake Formation Permissions model as described the &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/upgrade-glue-lake-formation.html"&gt;Upgrade Tutorial&lt;/a&gt;&lt;strong&gt;.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Setup AWS Athena&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Import Taxi Data into Lake Formation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Allow NycWorkflowRole to read NYC Data
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Goto IAM &amp;gt; Policies &amp;gt; Create Policy&lt;/li&gt;
&lt;li&gt;Follow the wizard and create a policy with name &lt;strong&gt;S3NycBucketRead&lt;/strong&gt; with the following definition:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    {
      "Version": "2012-10-17",
      "Statement": [
                {
                        "Sid": "VisualEditor0",
                        "Effect": "Allow",
                        "Action": [
                        "s3:GetObject",
                                "s3:ListBucket"
                        ],
                        "Resource": [
                                "arn:aws:s3:::nyc-tlc",
                                "arn:aws:s3:::nyc-tlc/*"
                        ]
                }
      ]
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Attach &lt;em&gt;S3NycBucketRead&lt;/em&gt; policy to &lt;em&gt;NycWorkFlowRole&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Create a database and grant permissions
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Head to AWS Lake Formation &amp;gt; Databases &amp;gt; Create Database&lt;/li&gt;
&lt;li&gt;Fill the form with details below and click on &lt;strong&gt;Create Database&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Name&lt;/strong&gt;: taxidata&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Click on Databases in left navigation pane&lt;/li&gt;
&lt;li&gt;Select &lt;em&gt;taxidata&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Click on Actions &amp;gt; Grant&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Fill up the form with following details and then click on &lt;strong&gt;Grant&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;IAM Users and Roles&lt;/strong&gt;: NycWorkflowRole&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database Permissions&lt;/strong&gt;: Create Table, Alter.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KWQisk57--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://tokern.io/static/24a2e138604c8e844281339b1beb84cf/ec5a1/grant-permissions.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KWQisk57--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://tokern.io/static/24a2e138604c8e844281339b1beb84cf/ec5a1/grant-permissions.webp" alt="" width="830" height="368"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Create a Crawler to register the data in Glue data catalog
&lt;/h3&gt;

&lt;p&gt;A Glue Crawler will read the files in nyc-tlc bucket and create tables in a database automatically.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Head to AWS Glue &amp;gt; Crawlers &amp;gt; &lt;em&gt;Add Crawler&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Fill in the following details in the wizard and click &lt;em&gt;Finish&lt;/em&gt; at the end.&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Crawler Name&lt;/td&gt;
&lt;td&gt;TaxiCrawler&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tags&lt;/td&gt;
&lt;td&gt;&lt;em&gt;project: lake_formation_example&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Crawler Source Type&lt;/td&gt;
&lt;td&gt;Data Stores&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Choose a data Store&lt;/td&gt;
&lt;td&gt;S3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Crawl data in&lt;/td&gt;
&lt;td&gt;Specified Path in another account&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Include Path&lt;/td&gt;
&lt;td&gt;s3://nyc-tlc/trip data/&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exclude Patterns&lt;/td&gt;
&lt;td&gt;&lt;ul&gt;
&lt;li&gt;fhv_tripdata_2015*&lt;/li&gt;
&lt;li&gt;fhv_tripdata_2016*&lt;/li&gt;
&lt;li&gt;fhv_tripdata_2017*&lt;/li&gt;
&lt;li&gt;fhv_tripdata_2018*&lt;/li&gt;
&lt;li&gt;yellow*&lt;/li&gt;
&lt;li&gt;green*&lt;/li&gt;
&lt;/ul&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Add Another Data Store&lt;/td&gt;
&lt;td&gt;YES&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Choose a data Store&lt;/td&gt;
&lt;td&gt;S3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Crawl data in&lt;/td&gt;
&lt;td&gt;Specified Path in another account&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Include Path&lt;/td&gt;
&lt;td&gt;s3://nyc-tlc/misc/&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exclude Patterns&lt;/td&gt;
&lt;td&gt;&lt;ul&gt;
&lt;li&gt;*foil*&lt;/li&gt;
&lt;li&gt;shared*&lt;/li&gt;
&lt;li&gt;uber*&lt;/li&gt;
&lt;li&gt;*.html&lt;/li&gt;
&lt;li&gt;*.zip&lt;/li&gt;
&lt;li&gt;FOIL_*&lt;/li&gt;
&lt;/ul&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Add another Data Store&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IAM Role&lt;/td&gt;
&lt;td&gt;Choose an Existing Role&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IAM Role&lt;/td&gt;
&lt;td&gt;NycWorkflowRole&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frequency&lt;/td&gt;
&lt;td&gt;Run On Demand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Choose an existing database&lt;/td&gt;
&lt;td&gt;TaxiData&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefix added to tables&lt;/td&gt;
&lt;td&gt;csv_&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;In the the crawlers list, select the &lt;em&gt;TaxiCrawler&lt;/em&gt; and click the &lt;em&gt;Run Crawler&lt;/em&gt; button. This should take less than a minute and it will crawl the CSV files you've not excluded in the taxi S3 bucket.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Database -&amp;gt; Tables&lt;/em&gt; and review the tables.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DMk1sEt9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://tokern.io/static/ef6fc204abc8d87cae7e241d19e468bb/ec5a1/tables.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DMk1sEt9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://tokern.io/static/ef6fc204abc8d87cae7e241d19e468bb/ec5a1/tables.webp" alt="" width="830" height="234"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Verify Data in AWS Athena
&lt;/h2&gt;

&lt;p&gt;Fire up &lt;a href="https://github.com/dbcli/athenacli"&gt;athenacli&lt;/a&gt; as &lt;em&gt;lakeadmin&lt;/em&gt; and run a couple of queries to verify the data looks good.&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;us-east-2:default&amp;gt; select * from taxidata.csv_trip_data limit 10;&lt;br&gt;
+----------------------+---------------------+------------------+--------------+--------------+---------+-------------------+&lt;br&gt;
| dispatching_base_num | pickup_datetime     | dropoff_datetime | pulocationid | dolocationid | sr_flag | hvfhs_license_num |&lt;br&gt;
+----------------------+---------------------+------------------+--------------+--------------+---------+-------------------+&lt;br&gt;
| B00013               | 2015-01-01 00:30:00 |                  | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;  | &amp;lt;null&amp;gt;            |&lt;br&gt;
| B00013               | 2015-01-01 01:22:00 |                  | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;  | &amp;lt;null&amp;gt;            |&lt;br&gt;
| B00013               | 2015-01-01 01:23:00 |                  | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;  | &amp;lt;null&amp;gt;            |&lt;br&gt;
| B00013               | 2015-01-01 01:44:00 |                  | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;  | &amp;lt;null&amp;gt;            |&lt;br&gt;
| B00013               | 2015-01-01 02:00:00 |                  | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;  | &amp;lt;null&amp;gt;            |&lt;br&gt;
| B00013               | 2015-01-01 02:00:00 |                  | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;  | &amp;lt;null&amp;gt;            |&lt;br&gt;
| B00013               | 2015-01-01 02:00:00 |                  | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;  | &amp;lt;null&amp;gt;            |&lt;br&gt;
| B00013               | 2015-01-01 02:50:00 |                  | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;  | &amp;lt;null&amp;gt;            |&lt;br&gt;
| B00013               | 2015-01-01 04:45:00 |                  | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;  | &amp;lt;null&amp;gt;            |&lt;br&gt;
| B00013               | 2015-01-01 06:30:00 |                  | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;       | &amp;lt;null&amp;gt;  | &amp;lt;null&amp;gt;            |&lt;br&gt;
+----------------------+---------------------+------------------+--------------+--------------+---------+-------------------+&lt;br&gt;
10 rows in set&lt;br&gt;
Time: 4.926s&lt;br&gt;
us-east-2:default&amp;gt; select * from taxidata.csv_misc limit 10;&lt;br&gt;
+------------+-----------------+---------------------------+---------------+&lt;br&gt;
| locationid | borough         | zone                      | service_zone  |&lt;br&gt;
+------------+-----------------+---------------------------+---------------+&lt;br&gt;
| 1          | "EWR"           | "Newark Airport"          | "EWR"         |&lt;br&gt;
| 2          | "Queens"        | "Jamaica Bay"             | "Boro Zone"   |&lt;br&gt;
| 3          | "Bronx"         | "Allerton/Pelham Gardens" | "Boro Zone"   |&lt;br&gt;
| 4          | "Manhattan"     | "Alphabet City"           | "Yellow Zone" |&lt;br&gt;
| 5          | "Staten Island" | "Arden Heights"           | "Boro Zone"   |&lt;br&gt;
| 6          | "Staten Island" | "Arrochar/Fort Wadsworth" | "Boro Zone"   |&lt;br&gt;
| 7          | "Queens"        | "Astoria"                 | "Boro Zone"   |&lt;br&gt;
| 8          | "Queens"        | "Astoria Park"            | "Boro Zone"   |&lt;br&gt;
| 9          | "Queens"        | "Auburndale"              | "Boro Zone"   |&lt;br&gt;
| 10         | "Queens"        | "Baisley Park"            | "Boro Zone"   |&lt;br&gt;
+------------+-----------------+---------------------------+---------------+&lt;br&gt;
10 rows in set&lt;br&gt;
Time: 3.116s&lt;br&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Summary&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;This tutorial showed how to setup a basic data lake containing NYC Taxi Trip data using AWS Lake Formation.&lt;br&gt;
Now the data lake is ready to provide secure access to data engineers, analysts and data scientists.&lt;br&gt;
We would love to hear if this tutorial was helpful to you. Get in touch using the chat widget.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>database</category>
      <category>governance</category>
      <category>security</category>
    </item>
  </channel>
</rss>
