DEV Community: Shruti Nakum

How does Snowflake’s data sharing work, and what are its main advantages?

Shruti Nakum — Thu, 29 Jan 2026 12:53:24 +0000

Snowflake’s data sharing is a feature that allows you to share data across Snowflake accounts without copying or physically moving the data. Instead of exporting files or building separate pipelines, Snowflake uses secure metadata pointers to let other accounts access selected databases, schemas, or tables in real time.

From a technical point of view, data sharing works through secure shares. As a Snowflake developer, I define a share, add the required objects to it, and then grant access to a consumer account. The consumer sees the shared data as a read-only database, which means the original data stays protected while still being fully queryable.

One major advantage is consistency. Since both the provider and consumer query the same underlying data, there’s no duplication and no sync issues. Performance is also well-managed — the data provider pays for storage, while the consumer pays for compute when they run queries.

As a Snowflake developer, I personally started using this feature during a project where multiple teams needed the same reporting data, but copying it was causing delays and mismatched numbers. Once we switched to Snowflake data sharing, those issues disappeared because everyone was looking at the same source of truth.

This feature becomes especially valuable during Snowflake migration projects. Instead of migrating everything in one risky move, teams can share production data with a new Snowflake environment, validate transformations, compare dashboards, and slowly switch workloads over. I’ve used this approach to run parallel systems until the business was confident enough to fully move over.

In short, Snowflake’s data sharing combines security, performance, and simplicity. It removes the need for complex data transfers and plays a big role in modern analytics and migration strategies.

What are the different types of data sources you can connect to in Power BI?

Shruti Nakum — Fri, 16 Jan 2026 11:57:14 +0000

Power BI can connect to many different types of data sources, which is one of the reasons it’s so flexible. As Power BI developers, we usually group these sources into a few broad categories based on where the data lives.

First, there are file-based sources. This includes Excel files, CSVs, text files, XML, JSON, and PDFs. These are commonly used when data comes from spreadsheets, exports, or shared folders.

Next are database sources. Power BI connects easily to relational databases like SQL Server, MySQL, PostgreSQL, and Oracle, as well as cloud data warehouses such as Azure SQL Database, Snowflake, and Amazon Redshift. These are typically used when working with structured, high-volume data.

Power BI also supports cloud and online services. You can connect to platforms like Azure Data Lake, SharePoint, OneDrive, Google Analytics, Salesforce, Dynamics 365, and other SaaS applications. This is very useful when data is spread across different business tools.

Another important category is big data and analytics platforms. Power BI can connect to tools like Azure Synapse, Databricks, Hadoop, and Spark, which are designed to handle very large datasets.

Finally, there are APIs and web sources. Power BI can pull data from REST APIs, web pages, and custom data feeds. This is helpful when data isn’t stored in traditional databases but is available through web services.

In simple terms, Power BI developers can connect to almost any place where data exists, making it easier to bring everything together into a single, meaningful dashboard.

Can you describe a complex data architecture you’ve designed or implemented in the past?

Shruti Nakum — Tue, 30 Dec 2025 12:46:26 +0000

In one of my projects, I designed a data architecture to handle data coming from multiple sources such as application databases, third-party APIs, event logs, and flat files. The main challenge was that the data volume was large, it arrived at different speeds, and different teams needed it for different purposes.

I started by setting up a cloud-based data lake where all incoming data was stored in its raw form. This acted like a safety net — if something broke downstream, we could always go back to the original data. From there, I built automated pipelines to clean, standardize, and transform the data before loading it into a data warehouse for reporting and analytics.

To make the system reliable, I added validation checks at every stage, like row counts, schema checks, and null value detection. If anything looked off, alerts were triggered so we could fix issues before they affected dashboards or reports. I also separated the architecture into layers (raw, processed, and curated) so data stayed organized and easy to manage.

Performance and cost were also important. I optimized pipelines to run only when needed, used partitioning for large tables, and scaled compute resources based on workload. This helped keep processing fast without wasting money.

The final setup allowed business users to access clean dashboards, while data scientists could work with trusted, well-structured data for modeling and analysis. As a data engineer, the focus was on building a system that could grow with the business, stay stable over time, and make data easy for everyone to use.

How can you incorporate external knowledge into an LLM?

Shruti Nakum — Fri, 05 Dec 2025 06:29:49 +0000

You can incorporate external knowledge into an LLM by giving it access to information that isn’t part of its original training. The most common way to do this is retrieval-augmented generation (RAG). In simple terms, instead of the model guessing from memory, you let it search your documents, database, or API first, then use that information to answer. It’s like giving the model a quick reference guide before it responds.

Another way is fine-tuning, where you train the model on your own examples so it learns your specific style, rules, or domain knowledge. This is useful when you want the model to follow a certain pattern or understand very niche topics, and it’s something that comes up a lot in LLM development work.

You can also use tools or plug-ins, where the model calls external systems, for example, asking a calculator for math, a search API for live data, or a knowledge base for facts. The model doesn’t store the info itself; it just knows how to fetch it.

Overall, the idea is simple: instead of relying only on what the LLM already knows, you connect it to the right sources so it can pull in accurate, updated, or specialized information whenever needed.

Suppose there is a dataset having variables with missing values of more than 30%, how will you deal with such a dataset?

Shruti Nakum — Fri, 28 Nov 2025 13:59:25 +0000

If a variable has more than 30% missing values, I treat it carefully because that much missing information can weaken the model. First, I try to understand the cause: is it random, system-driven, or does it follow some pattern? Knowing this helps me decide if the feature is still useful.

If the column doesn’t provide strong value, the safest choice is to drop it. For important variables, I look at different imputation methods. Numeric fields might use median or interpolation, while categorical fields can use mode. If the feature is valuable but tricky, I may use advanced methods like KNN imputation or model-based imputation to estimate missing values.

Sometimes missing values are meaningful on their own, for example, “not filled because user didn’t use the feature.” In those cases, I keep the column and also create a separate flag like “is_missing” to capture that information.

The goal is to keep the dataset balanced, clean, and meaningful without forcing incomplete or low-quality data into the modeling process, and most data scientists prefer a cautious approach here to avoid introducing bias.

Series-1 What do you understand by Imbalanced Data?

Shruti Nakum — Fri, 28 Nov 2025 13:57:58 +0000

Imbalanced data means the classes in your dataset are not represented equally. One category has a lot of samples, while the other has very few. For example, imagine a medical dataset where 95% of patients are healthy and only 5% have a rare disease, that’s clearly imbalanced.

The issue is that models trained on such data tend to learn the “easy pattern,” which is predicting the majority class every time. This makes the accuracy look high, but the model is actually useless for detecting the minority class, which is often the most important one.

To handle this, I use techniques like oversampling the minority class (SMOTE), undersampling the majority class, using class-weighted algorithms, or choosing models that naturally handle imbalance better. I also focus more on metrics like F1-score, recall, and precision rather than plain accuracy.

In my experience, dealing with imbalance isn’t about making the data look flawless, it’s about guiding the model to focus on the signals that actually matter. With a bit of extra care, the model’s real-world performance improves a lot, especially when the minority class is the critical one. This is why data scientists often spend extra time tuning these scenarios instead of relying on raw accuracy.

How can you optimize loading performance when aggregating content from multiple websites into a single course module?

Shruti Nakum — Thu, 20 Nov 2025 12:44:34 +0000

When content comes from different websites, the page can slow down because it’s trying to load everything at once. To avoid that, I usually save or cache the content on our server first, so the course loads from one place instead of calling many websites every time. I also make sure images are smaller, code is clean, and only the necessary files are loaded.

For sections that aren’t needed immediately, I use lazy-loading so those sections load only when someone scrolls to them. The module loads the core content first, and the rest comes in only when the user reaches it. Using asynchronous requests keeps the UI responsive, and batching multiple API responses into one payload reduces load time even more. As a web developer, these simple steps make the module feel much faster and smoother, even if the content originally came from several different websites.

Series-3 How can we refresh the data in a report published to Power BI Service?

Shruti Nakum — Wed, 05 Nov 2025 09:22:32 +0000

To refresh the data in a report published to the Power BI Service, I usually make sure that the dataset has a proper connection and refresh setup. If the report uses imported data, I configure a scheduled refresh in the Power BI Service, setting the frequency (daily, hourly, etc.) and time slots based on business needs. For DirectQuery or live connections, the data refreshes automatically whenever users interact with the report, since it queries the source in real time.

When a report is connected to on-premises data sources, I set up a data gateway to securely connect Power BI Service to the local databases. This allows both manual and scheduled refreshes to work without issues.

I also monitor the refresh history to make sure it’s completing successfully and troubleshoot any failures, like gateway connection problems, credential issues, or query timeouts.

Power BI developers like me also ensure that refresh policies are optimized and that users always see up-to-date, accurate data in their reports.

Series-2 How do you implement row-level security in Tableau?

Shruti Nakum — Wed, 05 Nov 2025 09:20:30 +0000

When managing and monitoring Tableau Server performance, I focus on both system health and workbook efficiency. I regularly check the Server Status and Admin Views in Tableau to monitor CPU usage, memory consumption, background tasks, and extract refresh times. This helps me identify any bottlenecks early on.

I also review workbook performance, optimizing dashboards by reducing complex calculations, minimizing filters, using extracts instead of live connections when appropriate, and limiting the number of quick filters or high-cardinality fields. A lot of performance issues come from poorly optimized workbooks, so I work closely with report creators to improve design and query logic.

In addition, I schedule heavy extract refreshes during off-peak hours and make sure the data engine, VizQL server, and backgrounder processes are properly balanced based on user load. I also keep an eye on log files and use tools like Tabadmin or TSM for deeper diagnostics and fine-tuning.

Tableau developers like me also rely on consistent performance monitoring, optimization, and user feedback to ensure the server runs smoothly and delivers fast, reliable dashboards for everyone.

Series-1 How do you deal with class imbalance in a dataset when training a model?

Shruti Nakum — Wed, 05 Nov 2025 09:18:46 +0000

When I deal with class imbalance in a dataset, the first thing I do is understand how severe the imbalance is, by checking class distributions and looking at metrics like the ratio of minority to majority classes. Once I know the extent, I choose the best strategy based on the problem and data size.

If the dataset is small, I often use resampling techniques, like oversampling the minority class with methods such as SMOTE or undersampling the majority class to balance the data. When the dataset is large, I prefer using class weights in algorithms like logistic regression, random forests, or XGBoost so the model gives more importance to the minority class without losing information.

I also make sure to use evaluation metrics that work well with imbalanced data, like precision, recall, F1-score, or the AUC-ROC curve, instead of just accuracy, which can be misleading.

In some cases, data scientists like me experiment with ensemble methods or anomaly detection approaches when the imbalance is extreme. The goal is always to help the model learn meaningful patterns rather than being biased toward the majority class.

it's not dumb, but i guess it can be helpful

Shruti Nakum — Fri, 31 Oct 2025 05:25:58 +0000

How do you ensure data integrity and quality in your data pipelines?

Shruti Nakum ・ Oct 29

#discuss #dataengineering #learning

How Do You Handle Data Mapping and Field Mapping During Migration?

Shruti Nakum — Thu, 30 Oct 2025 13:57:45 +0000

When handling data mapping and field mapping during migration, I start by gaining a clear understanding of both the source and target systems. I analyze their data structures, field names, data types, and relationships to identify how each field in the source maps to the corresponding field in the destination.

I usually create a data mapping document that outlines every field, its transformation rules, and any changes needed, such as format conversions, default values, or data type adjustments. This document acts as a blueprint for the migration process and helps maintain consistency across the team.

If there are fields that don’t have a direct match, I work with stakeholders to decide how to handle them, sometimes through data transformation, sometimes by creating new fields in the target system.

Before final migration, I perform test runs to validate that the mapped data loads correctly and maintains integrity. Automation tools and scripts can help with this, but I always include manual validation to catch anything the tools might miss.

Being a seasoned developer, make sure that I offer Data migration services that rely heavily on precise field mapping and validation to ensure that every piece of data ends up in the right place with the correct format and meaning.