DEV Community: Mohamed Fayaz

AWS CodeCommit and GitKraken Basics: The Essential Skills for Every Developer

Mohamed Fayaz — Thu, 06 Apr 2023 07:13:10 +0000

Git is a source code management system that keeps track of the changes made to their codebase and collaborates with other team members. It is commonly used for source code management in software development, but it can track changes to any set of files.

In a version control system, every time a file is modified, a new version is created and saved. This allows users to go back to previous versions of the file, and it also makes multiple engineers collaborate simultaneously without overwriting each other's changes.

Git keeps track of changes to a repository, a central location where files are stored. When a user wants to change a file, they first create a copy of the file, make their changes to the copy, and then commit the changes, which saves the modified version of the file to the repository. Other users can then pull the updated version of the file from the repository and merge it into their copies.

Git also includes:

Features for comparing different file versions.
Resolving conflicts when multiple people have made changes to the same file.
Tracking the history of changes to a file over time.

It is a powerful tool widely used in software development and other fields where version control is essential.

👷🏼‍♀️️Git Workflow

The Git workflow is the series of steps people follow when using Git to manage a project. Here is a general outline of the Git workflow:

Create a repository: A repository is a central location where all the files for a project are stored. To create a repository, you can use the git init command to initialize an empty repository on your local machine or create a repository on a remote server and clone it to your local machine.
Create a branch: you are creating a snapshot of your project at that point in time and diverging from the main development branch. You can then change your codebase on the new branch without affecting the "main" branch. This allows you to experiment and work on new features without worrying about breaking the main codebase. When you are ready, you can merge your changes back into the "main" branch to incorporate them into the codebase. You can create a new branch using the git branch command, and switch to a different branch using the git checkout command.
Make changes: Once you have a repository set up, you can start changing the files in the repository. When you change a file, Git will recognize that the file has been modified.
Stage changes: Before you commit your changes, you need to stage them. This involves adding the modified files to a staging area, a list of changes that will be included in the next commit. You can stage changes by using the git add command.
Commit changes: When you are ready to save your changes, you can commit them to the repository. A commit is a snapshot of the repository at a particular point in time, and it includes all the staged changes. You can commit your changes by using the git commit command.
Push changes: If you are working with a repository on a remote server, you will need to push your changes to the server for them to be shared with others. You can do this by using the git push command.
Pull changes: If other people have made changes to the repository and pushed them to the server, you can pull those changes to your local machine by using the git pull command. This will merge the changes into your local repository.
There are many other Git commands available, and you can learn more about them in the Git documentation or by using the git help command.

♛ Branching Strategy

Git branching allows developers to work on multiple features or bug fixes simultaneously within a single Git repository. By using branches, developers can isolate their work from the main codebase and merge their changes back in when they are ready. There are several strategies for using branches in Git, including:

The "Gitflow" strategy, involves creating long-lived branches for development and releases, as well as short-lived branches for hotfixes and features.
The "Feature Branch" strategy, involves creating a new branch for each feature or bug fix and merging the branches back into the main codebase when the work is complete.
The "Trunk-Based Development" strategy, involves working directly on the main codebase (also known as the "trunk") and using short-lived branches for quick bug fixes or experiments.

The best branching strategy will depend on the specific needs and workflow of your development team.

If you prefer a graphical interface over using the command line, then GitKraken is free and a useful tool. It has a user-friendly interface that allows you to perform all the Git operations via the interface and can save time for beginners.

⚒️ GitKraken

GitKraken is a popular Git client for developers that offers a user-friendly interface and a variety of features to make working with Git easier. It is available for Windows, Mac, and Linux and can be used with a wide range of Git hosting services, including GitHub, GitLab, and Bitbucket.

One of the key features of GitKraken is its intuitive graphical user interface, which makes it easy to visualize and manage Git repositories. The interface includes a visual representation of the commit history, called the commit graph, which allows users to see the changes made to the repository over time easily. It also includes a "Git flow" feature, which helps users follow best practices when working with Git branches.

In addition to its graphical interface, GitKraken also offers a range of advanced features for developers. It includes tools for code review, such as leaving inline comments on code changes. It supports a range of collaboration features, including the ability to assign tasks to team members. It also includes integration with popular tools like JIRA and Slack, making it easy to connect your Git workflow with other tools you use.

Managing work in progress, pull requests, and branches for issues across multiple repositories can be challenging. That's why the release of GitKraken Client 9.0 is so exciting. This major version release is expected to significantly improve the development workflow.

Let's demonstrate how to use AWS CodeCommit as a version control system and GitKraken as the graphical user interface to manage your Git workflow

Step 1: Install GitKraken: Download and install GitKraken on your local machine.
Step 2: Set up an AWS account: If you don't have one, you'll need to create one if you choose to use AWS CodeCommit.
Step 3: Connect GitKraken to your AWS CodeCommit Repository: Before we connect, we need to generate credentials to clone the CodeCommit repo over HTTPS.
Step 4: Go to AWS IAM > Users > Security Credentials > Generate Credentials.
Step 5: Create a CodeCommit repository: Log in to the AWS Management Console and navigate to the CodeCommit dashboard. Click the "Create repository" button and follow the prompts to create a new repository.
Step 6: Clone the repository: In GitKraken, go to File > Clone Repo and enter the URL of your CodeCommit repository. This will create a local copy of the repository on your machine.

Step 7: Enter the username and password generated in step 4.
Step 8: Make changes and commit: Make any desired changes to the files in your local repository and use GitKraken to stage and commit the changes.

Step 9: Push changes to CodeCommit: Use GitKraken to push your local commits to the CodeCommit repository, which will make them available to other users.

Conclusion

A tool like GitKraken is very helpful if you are switching between various repos across different accounts. Also while GUI like GitKraken can be useful for users who are new to Git or who prefer a graphical interface, many experienced Git users prefer to use the command line interface (CLI) because it offers more control and flexibility.

✍️ About the Author:

Stay ahead of the curve with Mohamed Fayaz, a visionary Data & AI consultant and insightful technical blogger. He provides expert commentary on vital topics such as Software Engineering, Big Data Analytics, Cybersecurity, and Cloud Engineering. Follow him on LinkedIn or Twitter for the latest updates.

Privacy-Preserving AI: What It Is and Why It Matters?

Mohamed Fayaz — Sat, 25 Mar 2023 09:18:36 +0000

Let’s start with an example; suppose your employer uses an AI system to analyze employee data and make decisions about hiring and promotion. In that case, there’s a chance it could use your race or gender as one of its criteria features for making those decisions. If this happens without your knowledge or consent, especially if you don’t agree with how they’re being used, then there could be legal implications for both the company and the employees. This problem has been at least partially addressed by restricting access to certain types of sensitive information like faces and gender while still allowing access through other channels such as text search terms or GPS coordinates. However, these solutions still don’t fully solve all privacy concerns since there will always be ways around them.

This problem has been at least partially addressed by restricting access to certain types of sensitive information like faces and gender while still allowing access through other channels such as text search terms or GPS coordinates, but these solutions still don’t fully solve all privacy concerns since there will always be ways around them; someone could not take pictures without including any faces at all.

Why It Matters

Now imagine if Facebook had been able to use facial recognition software on all those photos uploaded over the years; they could easily build up an extensive database containing thousands upon thousands of faces without ever asking permission from any users. Likewise, a government agency or corporation could use an AI system like this one at airports or checkpoints without asking permission from anyone who might pass through those checkpoints or considering how many times we’ve seen companies like Facebook ask us for access so that “our friends” can see what we’re doing (and vice versa). This is particularly concerning because many applications where users can share their data with friends and family members. In these cases, sharing personal information may be voluntary but still allow outsiders access due to default privacy settings in most apps that makes sensitive data available publicly by default or require affirmative action from users before making any changes (such as turning off location services).

What Can Be Done?

AI solutions should be designed in such a way that they can learn without being able to identify individuals by name or other direct identifiers. This is important because it ensures that your data remains anonymous and private while allowing the machine-learning algorithm to make accurate predictions. An example of this kind of AI system is one that predicts whether someone will be diagnosed with a specific disease in the future based on their genome sequence data. In this case, before making any predictions about an individual’s likelihood of developing cancer or Alzheimer’s disease based on their genetic makeup, all personal information should be removed from the dataset (e.g., names, religion, nationality, and so on). This allows researchers to study trends across large populations without compromising anyone’s privacy. These types of systems are called “de-identified,” meaning any information about the individuals in the dataset is removed before it is used. It is a privacy protection measure that allows researchers and other individuals with access to de-identified datasets to analyze the information without compromising the privacy of those represented within it. Now, de-identifying is not the same as anonymizing data: anonymized datasets may still contain indirect identifiers like zip codes, birthdates, or phone numbers, which could be used to identify individuals if they were linked together with other sources (such as census records). For example, your 10-digit mobile number may be changed with a different set of digits, but this would still be the same across the dataset.

Approaches for Making AI More Private and Secure

Differential Privacy

Differential privacy is a widely used technique for privacy-preserving ML. The goal of differential privacy is to protect the privacy of individual data points in a dataset by adding noise to the data. The noise is added to ensure that the ML model’s output is not affected significantly. It involves adding a carefully calibrated amount of noise to the data, making it difficult to identify an individual’s data. The amount of noise added to the data is determined by a parameter called epsilon. The higher the epsilon value, the more noise is added to the data and the greater the privacy protection. However, as the number of noise increases, the accuracy of the ML model decreases. Therefore, finding an optimal value for epsilon is crucial to achieving a balance between privacy and accuracy.

Federated Learning

Various applications have used Federated learning, including natural language processing, image classification, and recommendation systems. In addition, federated learning can be used in situations where data privacy is crucial and can also be used in situations where data is not easily transferable to a central location. For example, it can be used for training ML models on mobile devices, where users may not want to share their data with a central server.

Homomorphic Encryption

Homomorphic encryption allows the system to perform operations on the encrypted data without decrypting it. This technique can be used to train ML models on encrypted data, ensuring that the data remains private. Homomorphic encryption can be applied to various ML models, including linear regression, decision trees, and neural networks. Encryption involves using mathematical operations on encrypted data without decrypting it. This technique is particularly useful when working with sensitive data that cannot be shared, such as medical records or financial data. However, homomorphic encryption is computationally expensive, which makes it less practical for some applications.

Libraries to Build Privacy-Preserving Models

Conclusion

Privacy is becoming an increasingly important concern in the field of AI. Fortunately, there are several techniques available for developing privacy-preserving models. Differential privacy, federated learning, and homomorphic encryption are just a few examples of these techniques. With the right approach, we can continue to make progress in the field of machine learning while also addressing privacy concerns.

I hope this article gives you a better understanding of why it is essential to have privacy-preserving systems and how best we can protect individual privacy while developing a more robust and advanced Artificial Intelligence system.

✍️ Want to Connect?

Mohamed Fayaz is a Data & AI Consultant and a technical blogger who writes and speaks about the topics such as Software Engineering, Big Data Analytics, and Cloud Engineering. Connect with him on LinkedIn or follow him on Twitter for updates.

Why Data Competency Is Critical for Cyber Intelligence

Mohamed Fayaz — Mon, 20 Feb 2023 01:45:09 +0000

It's no secret that data engineering skills are a must for cyber intelligence pros. But what exactly is data engineering? It's the ability to understand the meaning behind complex algorithms, and the discipline of applying those algorithms to make sense of large amounts of data. Data engineers are also adept at building systems that process and store information in ways that allow people to understand it—and even use it to predict future events.

From Hackers to Cyber Intelligence

Data engineering is a skill that's used to collect and analyze data. It can be used to understand the quality of the data, as well as make predictions about it. This makes it an important skill for cyber intelligence professionals because they must ensure that their systems collect good quality information to make accurate predictions.

Data science is a large field with many different sub-specialties within it—and data engineering is one such specialty. Its focus on understanding how systems work means that these professionals have an affinity for building things from scratch; this includes programming languages like Python or R rather than just using pre-existing software packages available via third-party vendors (such as Google Analytics).

Data Engineering for Understanding

To begin any analysis, it's crucial to comprehend the initial data. Data engineers play a vital role in extracting valuable insights from raw data and converting it into informative data. This process includes discovering patterns, trends, and connections within data sets that aid cyber intelligence experts in making educated decisions about their field of focus.

Data engineering skills are also needed when working with big datasets (such as social media or web usage). These types of data sets have many variables that need to be analyzed individually; however, they often have similar properties across different groups within those groups. To understand what these properties mean and how they relate together, you need someone who understands how each variable behaves individually—and then aggregates their results onto larger scales (e.g., country-wide).

A Pipeline for Cyber Intelligence

Data engineering is a necessary skill for cyber intelligence. It's the process of building a pipeline to collect data and then analyze it. A pipeline consists of data sources, a data store, and an analysis tool. This can be done in several steps:

First, you'll need to define your queries or use pre-built tools that help you find specific information from your source (e.g., search history). Then draw connections between different pieces of information using indexes or keywords; these connections are called joins in databases because they join together multiple rows into one result set (the "join" part comes from joining data sets). You may also want to export this data so that it can be used elsewhere without having access directly through its original source—this is called ETL (extraction transformation load) or ETL/ELT (extraction transformation load/export). Finally, visualize graphs representing all results and any additional analyses performed on them - this process is called BI (business intelligence).

The Future of Cyber Threats

The future of cyber threats is a digital arms race. It will be a battle for data, for the cloud and for AI.

In this digital warfare, intelligence agencies have been forced to act quickly in order to stay ahead of their adversaries' capabilities. They need to understand how these technologies work so that they can intercept communications or take over control systems before it's too late—and they need more data scientists who are trained in machine learning techniques like deep learning (DL), which uses neural networks as part of its predictive analytics process. This means that even if you're not a coder yourself (or don't know what one looks like), you could still benefit from taking classes on DL theory at your local community college or university program.

Data engineering skills are vital for cyber intelligence pros.

Data engineering skills are vital for cyber intelligence pros. They're needed to analyze data and make it more useful, as well as to create insights from large sets of information.

Data engineers work on projects that require them to pull together large sets of structured or unstructured data into a cohesive whole. They often use the tools available in their field: SQL databases, text analysis software like Natural Language Toolkit (NLTK), statistical packages with R and Python, but also visualizations such as Tableau or D3js.

Data engineering is a complex skill set that can be mastered by anyone with the right mindset and training. It’s an essential skill for cyber intelligence pros, who need to understand how the systems they work with work, as well as how they are used. Data engineers have a crucial role to play in protecting organizations from cyber threats and ensuring that their data is safe from hackers or other malicious actors. With all of these skills under your belt, you’ll be able to make smart decisions when it comes time for action.

Conclusion

In conclusion, implementing standards such as the Open Cybersecurity Schema Framework (OSCF) enables data engineers to standardize data collection and mapping, leading to a streamlined process for security teams. This results in a common language for data analysis and threat detection, facilitating collaboration between data scientists, analysts, and security teams. The ultimate aim of any framework is to serve as a widely accepted open standard that complements existing security practices and processes.

✍️ About the Author:

Mohamed Fayaz is a Data & AI Consultant, and a technical blogger who writes and speaks about the topics such as Software Engineering, Big Data Analytics, and Cloud Engineering. Connect with him on LinkedIn or follow him on Twitter for updates.

Introduction to AWS AppSync - Fully managed GraphQL Service

Mohamed Fayaz — Thu, 02 Feb 2023 02:44:54 +0000

Build, deploy, and manage mobile and web apps that need real-time or offline data are simple with AWS AppSync, a fully managed serverless GraphQL service. Your apps may securely access and work with data stored in AWS services like Amazon DynamoDB, Amazon Elasticsearch Service, and AWS Lambda by building GraphQL APIs using AppSync.

One of the key benefits of using AppSync is that it allows you to build scalable, responsive applications without the need to manage complex server infrastructure. AppSync handles all of the underlying network and security protocols for you, allowing you to focus on building great user experiences.

In addition to providing real-time and offline data access, AppSync also offers a number of other features that make it a powerful tool for building modern applications. These features include:

⦿ GraphQL Transform: This is a tool that helps you build GraphQL APIs quickly and easily by providing a set of pre-built, customizable GraphQL resolvers.

⦿ Subscriptions: AppSync allows you to create real-time subscriptions that allow your clients to receive updates in real-time when certain events occur, such as when data is updated or deleted.

⦿ Data manipulation: AppSync provides a number of powerful data manipulation capabilities, including the ability to create, update, and delete data, as well as the ability to perform complex queries and mutations on data.

⦿ Security: AppSync integrates with AWS Identity and Access Management (IAM), AWS Cognito, and API Keys to provide fine-grained access controls and protect your data from unauthorized access.

Some other key security features of AWS AppSync include:

Data encryption: All data is encrypted in transit and at rest using industry-standard encryption algorithms.
Identity and access management (IAM): You can use IAM to control access to your AWS AppSync resources and data.
VPC Endpoints: You can use VPC Endpoints to securely access your AppSync APIs from within your Amazon VPC, without exposing your APIs to the public internet.
Resource-level permissions: You can use resource-level permissions to control access to specific GraphQL operations and fields.
Amazon Cognito authentication: You can use Amazon Cognito to authenticate users and authorize access to your AppSync APIs.
OAuth 2.0 support: You can use OAuth 2.0 to authenticate users and authorize access to your AppSync APIs.
Identity federation: You can use identity federation to authenticate users with third-party identity providers, such as Google or Facebook.

1️⃣️ Schema

In AppSync, the schema defines the shape of your data and the operations that can be performed on it. The schema is written in GraphQL, which is a language for querying and mutating data. The schema consists of types, fields, and relationships between those types.

2️⃣ Resolvers

Resolvers are functions that resolve a GraphQL query to a specific data source. When a client issues a query to the GraphQL API, the query is forwarded to the appropriate resolver for handling. The resolver retrieves the requested data from the data source and returns it to the client.

3️⃣️ Mutations

AppSync also allows you to define mutations, which are operations that modify data. Mutations work similarly to queries, but they require a resolver to handle the data modification and return the updated data to the client.

The schema defines the structure of the data and the operations that can be performed on it, while the resolvers handle the actual retrieval and modification of data from the data sources.

For example, consider a schema that has a type called "Post" and fields called "title" and "content". To retrieve the data for a specific Post, you might define a resolver that queries a DynamoDB table for the Post with a specific ID. The resolver would return the Post's title and content to the client as a response to the query. Mutations helps to make any update to the title or post.

🎥 Demo

First, let’s create a GraphQL endpoint to retrieve the title and content from post DynamoDB.

Step 1: Login to the console and go to AppSync

Step 2: Choose Create with wizard and Create a model. The model should contains the list of fields you want to store and retrieve from DynamoDB Table.

Step 3: Click Create, and on the next screen name your API "My AppSync App API". Click Create.
The AppSync console will deploy your DynamoDB table and create your AppSync schema. The schema includes autogenerated queries, mutations, and subscriptions.

Step 4: On the left, you can see the list of tabs such as Schema, Data Sources, Functions, Queries, Caching, Setting and Monitoring.

Schema: The schema defines the types, fields, and operations (queries, mutations, and subscriptions) that clients can execute on the API. The schema is written in the GraphQL schema definition language (SDL) and is used to validate client requests and generate a client-side code generation.
Data Sources: AppSync allows you to connect to various data sources, such as DynamoDB tables, Lambda functions, and Elasticsearch domains, to retrieve and manipulate data in your API. You can also use AppSync's built-in data sources for authentication and authorization.
Functions: AppSync allows you to write custom Lambda functions to perform additional logic or data manipulation before or after data is retrieved from a data source. These functions can be triggered by specific events, such as a client query or mutation.
Queries: Clients can use the GraphQL query operation to retrieve data from the API. AppSync automatically maps the client query to the appropriate data source and resolves any fields in the query that are defined in the schema.
Caching: AppSync allows you to enable caching for specific queries to improve the performance of your API. Cached data is stored in an in-memory cache and is automatically invalidated when the data in the data source is updated.
Setting and Monitoring: AppSync provides a variety of settings and monitoring options that allows you to customize the behavior of your API and monitor its usage. You can set up logging and tracing for debugging and troubleshooting, configure caching and connection pooling, and use CloudWatch for monitoring metrics and log streams.

Let’s go to schema and look for the schema, resolvers and query to be able to fetch the title and content from the DynamoDB.

type MyModelType {
    id: ID!
    title: String
    content: String
}

type Query {
    getMyModelType(id: ID!): MyModelType
}

On the right, you can see the Resolvers are set to the Dynamo table.

Now to query the table by Id, we can go to the queries tab and run this below snippet to fetch the information

query DemoQuery {
  getMyModelType(id: "123") {
    id
    content
    title
  }
}

The endpoint uses the API Key to authenticate the requests. With this API Key and the URL we can also use the Postman to try running the query.

Go to Settings tab to get the credentials

Now in the Postman..

Enter the copied URL and API key, and the query in the Body.

Now in this demo, we have seen how GraphQL fetches the data from DynamoDB. Similarly you can have a different Resolver such as Lambda Function to handle the request.

Securing your AppSync Endpoint..

AWS Web Application Firewall (WAF) is a web security service that helps protect web applications from common web exploits that could affect availability, compromise security, or consume excessive resources.

To secure an AppSync endpoint with WAF, you can create a WAF rule and associate it with the AppSync endpoint. The rule can be configured to block or allow traffic based on certain criteria, such as IP address or request headers. This can help protect against common web attacks such as SQL injection, cross-site scripting, and others. Additionally, you can also use WAF to rate-limit requests to the AppSync endpoint to prevent denial-of-service attacks.

Conclusion

AWS AppSync is a powerful tool that makes it easy to build, deploy, and manage real-time and offline applications that require data access and manipulation. It supports various use cases from Data Ingestion to pub/sub services.

✍️ About the Author:

Data & Analytics Services at AWS re:Invent 2022: A Recap

Mohamed Fayaz — Tue, 03 Jan 2023 10:16:52 +0000

AWS re:Invent is a learning conference hosted for the global cloud computing community with in-person and virtual content where they announced many new features and updates.

This post is the summary of the list of changes within the Data & Analytics services

AWS DataZone

Amazon DataZone is a tool that helps organizations catalog and share data across the company. It allows data producers (such as data engineers and data scientists) to share data securely and with the right context, and allows data consumers (such as analysts) to find answers to business questions and share them with others in the organization.

DataZone is intended to provide an easy way to organize and discover data across the organization. It allows users to share, search, and discover data at scale across organizational boundaries through a unified data analytics portal that provides a personalized view of all the data while enforcing governance and compliance policies.

The tool creates a usage flywheel, where data producers share data securely and with the right context with others in the organization, and data consumers find answers to business questions and share them with others in the organization. This helps improve operational efficiency and enables business and data teams to work with data faster and make informed decisions based on the data. DataZone also aims to remove the burden of governing data and making it accessible to everyone in the organization, giving organizations a competitive edge by turning data into an organizational asset.

AWS Clean Room

AWS Clean Rooms is a solution that enables companies to collaborate on shared data sets while still protecting the underlying raw data. This is particularly useful for companies in industries such as financial services, healthcare, and advertising that need to collaborate with partners while also improving data security and protecting underlying data. The traditional methods for leveraging data in collaboration with partners, such as providing copies of data and relying on contractual agreements, can be at odds with protecting data.

AWS Clean Rooms allows customers to create a secure data clean room in minutes and collaborate with other companies on the AWS Cloud to generate insights about advertising campaigns, investment decisions, and research and development without having to share or reveal raw data.

Some features and benefits of AWS Clean Rooms include the ability to create a clean room and start collaborating in a few clicks, the ability to collaborate with hundreds of thousands of companies on AWS without sharing or revealing underlying data, the use of privacy-enhancing controls to protect underlying data, and the ability to use easy-to-configure analysis rules to tailor queries to specific business needs.

AWS Clean Rooms will be available in early 2023 in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), Europe (London), and Europe (Stockholm).

AWS OpenSearch (Serverless)

Amazon OpenSearch Serverless is a new option offered by Amazon OpenSearch Service that simplifies the process of running search and analytics workloads at a large scale without the need to configure, manage, or scale OpenSearch clusters. It automatically provisions and scales the necessary resources to deliver fast data ingestion and query responses for even the most demanding workloads, and users only pay for the resources that are consumed.

OpenSearch Serverless decouples compute and storage and separates the indexing (ingestion) components from the search (query) components, using Amazon Simple Storage Service (S3) as the primary data storage for indexes. This allows the search and indexing functions to scale independently of each other and of the indexed data in S3. With OpenSearch Serverless, developers can create new collections, which are logical groupings of indexed data that work together to support a workload.

It also supports the same ingest and query APIs as OpenSearch, making it easy to get started with existing clients and applications, and it can be used to build data visualizations with serverless OpenSearch Dashboards.

AWS Glue Updates

⦿ AWS Glue 4.0 – Access to the latest Spark and Python releases so builders can develop, run, and scale their data integration workloads and get insights faster.

⦿ AWS Glue Data Quality - Automatic data quality rule recommendations based on your data

⦿ AWS Glue for Ray - Data integration with Ray (ray.io), a popular new open-source compute framework that helps you scale Python workloads

⦿ AWS Glue for Apache Spark - Supports three open source data lake storage frameworks: Apache Hudi, Apache Iceberg, and Linux Foundation Delta Lake.

⦿ AWS Glue Custom Visual Transform - Create and share your own ETL logic, input rules, etc. on Studio. Available on the Transform tab of Glue Studio. Master administrators in Glue Studio can improve efficiency for other workers.

AWS Redshift Updates

⦿ Apache Spark Integration - Author Apache Spark applications using Java, Python, Scala, with access to rich, curated data in your data warehouse

⦿ Streaming Ingestion Support - Kinesis Data Streams (KDS) and Managed Streaming for Apache Kafka (MSK) without staging in S3

⦿ Dynamic Data Masking - Easily protect sensitive data by managing data masking policies through an SQL interface

⦿ Auto-Copy From Amazon S3 - Simple, low code data ingestion

⦿ New Query sets – MERGE,ROLLUP,CUBE,GROUPING SETS

⦿ Supporting large JSON object – Up to 16MB (from 1MB)

⦿ Multi-AZ deployment

AWS QuickSight Updates

The new Amazon QuickSight feature that expands API capabilities allows customers to programmatically manage their QuickSight assets (analyses and dashboards) in their DevOps pipeline. Developers can now version control, back-up and deploy assets programmatically, thereby promoting faster changes that enable innovation in a competitive market place. This feature also accelerates migration from legacy BI tools to the cloud, supported by our migration partners.

⦿ Paginated Reports - Create, schedule, and share highly formatted multipage reports

⦿ Q Automated Data Prep - AI-enhanced automated data preparation, making it fast and straightforward to augment existing dashboards for natural language questions

⦿ QuickSight API - Access underlying data models of Amazon QuickSight dashboards, reports, analyses and templates via the AWS Software Development Kit (SDK

⦿ 2 new questions types – “forecast” and “why”. “forecast” created dynamic forecast dashboard, and “why” determines its data driver which related a specific data change

AWS Athena for Spark

Amazon Athena for Apache Spark is a new feature that allows organizations to perform complex data analysis using Apache Spark without the need to configure and manage separate infrastructure. It allows users to build distributed applications using expressive languages like Python, and it offers a simplified notebook experience in the Athena console or through Athena APIs.

Athena is deeply integrated with other AWS services, making it easy to query data from various sources, chain multiple calculations together, and visualize the results of analyses. The feature enables interactive Spark applications to start in under a second and run faster with an optimized Spark runtime, saving users time and allowing them to focus on insights rather than waiting for results. With Amazon Athena for Apache Spark, there are no servers to manage and no minimum fee or setup cost; users only pay for the queries they run.

AWS Aurora zero-ETL Feature

Amazon Aurora now supports zero-ETL integration with Amazon Redshift, allowing users to perform near real-time analytics and machine learning using Redshift on large amounts of transactional data from Aurora.
With this integration, data is available in Redshift within seconds of being written into Aurora, eliminating the need to build and maintain complex data pipelines for ETL operations.

The zero-ETL integration also enables users to analyze data from multiple Aurora database clusters in the same Amazon Redshift instance, providing a holistic view of data across multiple applications or partitions.
This allows users to leverage Redshift's analytics and capabilities, such as built-in machine learning, materialized views, data sharing, and federated access to multiple data stores and data lakes, to derive insights from transactional and other data in near real-time.

AWS LakeFormation Data Sharing Access Control

AWS Lake Formation is a fully managed service that makes it easy to build, secure, and manage a data lake. A data lake is a central repository that allows you to store all your structured and unstructured data at any scale. You can then use the data lake to build a centralized data repository, which can be used for a variety of tasks such as analytics, data warehousing, machine learning, and more.

The new feature makes it easier for customers to designate the right level of access to various users without having to run complex queries or manually identify who has access to specific data shares. It also improves the security of data by enabling administrators to provide granular, row-level and column-level access to data shares within Lake Formation. This is particularly useful for customers who want to share and work with consistent data across regions and accounts, but want to enforce granular access to different users.

AWS DocumentDB Elastic Cluster

AWS DocumentDB with MongoDB compatibility offers flexible scaling to store petabytes of data and handle millions of read/write requests per second. The infrastructure is managed by AWS, so there is no need for instance creation or scaling operations.
It also provides high availability across 3 availability zones (AZs) with data replicated to 6 locations across 3 AZs for high durability.

✍️ About the Author:

Effortlessly Manage Your AWS Applications with Just a Few Messages

Mohamed Fayaz — Wed, 21 Dec 2022 10:38:09 +0000

Several tools in the market that enable the engineering teams to easily roll out the application changes in various environments from a simple config to major code changes. But it often requires some level of technical understanding to make those updates.

Imagine empowering the non-technical business individuals to make those updates themselves by sending a message in a simple language (not in Python language tho 😂) which would save a lot of time on both ends.

🤨 How to make this happen?

It is possible with the combination of a few tools and by developing the solutions in such a way that it can accept parameters to control the features and behaviours.

In this post, we will take a look at how it can be done using Slack as our messaging platform, and AWS AppConfig to host and manage the configurations.

💬 ChatOps

ChatOps is a term used to describe the use of chat platforms, such as Slack or Microsoft Teams, for communicating about and automating tasks related to software development, infrastructure, and operations. The goal of ChatOps is to allow team members to collaborate and work more efficiently by providing a central location for communication, task execution, and status updates.

In ChatOps, team members can use chat commands to trigger automated tasks and processes, as well as view the status and output of those tasks in the chat platform. This allows team members to communicate and collaborate in real-time, without having to switch between multiple tools or applications.

For example, a team member could use a chat command to deploy code to a production environment, view the status of a continuous integration (CI) pipeline, or get the latest logs from a server. This can help teams to streamline their workflow and reduce the time and effort required to complete tasks.

⚙️ AppConfig

AWS AppConfig is a fully managed service that enables developers to quickly and easily roll out new features and updates to their applications. It allows you to create and manage application configurations, and deploy those configurations to specific environments or groups of users.

One of the main benefits of using AppConfig is that it helps you to deploy new features and updates to your applications more quickly and safely. With AppConfig, you can create and test new configurations in a staging environment before rolling them out to your production environment. This helps to reduce the risk of downtime or other issues that can arise when deploying updates to your application.

AppConfig also makes it easy to manage and track the deployment of your configurations. It provides many tools for monitoring the status of your deployments, including real-time alerts and detailed metrics. This allows you to quickly identify and resolve any issues that may arise during the deployment process.

In addition to its deployment capabilities, AppConfig also offers several other features that make it a powerful tool for managing application configurations. These features include:

⦿ Validation: AppConfig allows you to validate the configuration such as verifying the datatype and the formats.

⦿ Versioning: AppConfig provides built-in versioning capabilities, allowing you to track changes to your configurations and roll back to previous versions if needed.

⦿ Integration with other AWS services: AppConfig integrates with most of the AWS services, including Amazon CloudWatch and AWS CloudFormation, making it easier to manage and deploy your configurations.

Let's dive into the design

Based on the diagram above, the user sends a message in the Slack channel such as

/app-update {"background-colour": "black"}

/app-update deploy-to-prod

/edh-generate-report today

any message which starts with / can invoke a call to the endpoint in Slack, so in this case, we are calling AWS Lambda Function URL to process the message. It can update the configurations in AppConfig, to trigger the pipeline for production deployment or any workflow.

🎥 Demo

👍 Conclusion

In this demo, we invoked a call to the AWS Lambda Function URL with a payload and it returned the response to us as a message.

AppConfig helps to streamline the process of updating your applications and ensures that you can deliver new features and functionality to your users in a timely and reliable manner.

One of the alternatives to this is to build a web-based portal to manage these configurations to be edited by the individuals. Then the question arises whether we need an individual portal for each of those applications, managing user access controls, deployments and so on. Hence investing in a tool like AppConfig and creating a culture will help the organization to stay effective and efficient.

✍️ About the Author:

👨‍💻 DynamoDB for every developer

Mohamed Fayaz — Mon, 20 Dec 2021 09:21:20 +0000

Understanding the databases is important for every developer, and the arising cloud services bring the best in class tools and technologies for every developer to build applications in hours.

Database is a critical component for any application. NoSQL database has become a widely popular choice for various solutions from web to machine learning applications.

Let's understand one of the most popular cloud database services called DynamoDB.

❓What is DynamoDB?

DynamoDB is a fully managed NoSQL database from Amazon Web Services (AWS). It is a document database which means the data is stored in key/value pairs. The DynamoDB stores the data across 3 different availability zones in a region.

Since it is a fully managed service, DynamoDB is well suited for various workloads with any amount of data that require auto-scaling from small to large and anywhere in between.

📌 Key Benefits of using DynamoDB

⚡ Elasticity and Performance

Generally scaling databases is hard and risky, but DynamoDB gives you the ability to auto-scale based on the usage which allows your system to adjust according to the traffic and other needs.

🦾 Security

Access control policies and encryptions are available for your tables, so it makes the database more secure and accessible to the appropriate users without any bottlenecks.

🏁 Real-time event streaming

DynamoDB streams enable developers to access the data when it happens which makes the DynamoDB suitable for many complex solutions.

🧰 Database Management

DynamoDB creates and keeps the backup of your data safely on the cloud, so in the event of any occurrence, you can safely retrieve the data.

⏲️ Time to Live

TTL allows you to define when to expire the items in a table so they get deleted automatically from the database.

✅ Considerations

Some of the important guidelines to assess your suitability of using DynamoDB are as follows:

DynamoDB is well suitable for any business-critical applications which require higher scalability and is predominantly used for transactions processing (OLTP)
DynamoDB is not suitable for data warehousing and OLAP purposes.

⭐ Read and Write Consistency

DynamoDB supports two types of reading such as Eventually Consistent Read and Strongly Consistent Read, By default, it is set to Eventually Consistent Read.
In the Eventually consistent read, data may not be consistent as the copies of data will generally be available in 1 second. Whereas, a Strongly consistent read will always be up-to-date since it fetches the straight from the leader node.
DynamoDB does not support strongly consistent reads across AWS regions

🌋Throughputs and Capacity Management

DynamoDB offers two capacity modes, Provisioned and On-Demand. In the Provisioned mode, the developer would need to define the read and write capacity of the database.
On-demand mode is a flexible way to provision your DynamoDB tables to serve thousands of requests without doing any capacity planning.
Switching between provisioned and on-demand modes can be done once every 24 hours.

👨‍🔬 Calculating Read and Write Capacity Units

Read Capacity Units (RCU)

Read capacity units represents one strongly consistent read per second or two eventually consistent reads per second for an item of up to 4KB.

Let's see how we calculate RCU for strongly consistent read:

1 read RCU = 1 strongly consistent read of up to 4kb, So let's take an example of 50 reads at 40KB:

= 50 X 40/4 = 500 RCUs

For eventually consistent reads, it's two reads of up to 4KB. As an example of the same 50 reads at 40 KB:

= 50/2 X 40/4 = 250 RCUs

Write Capacity Units (WCU)

One write capacity unit represents one write of 1 KB per second, so for example, if we need 50 writes of 40 KB per item:

= 50 X 40 = 2000 WCUs

📑 Partitions

The partitions in DynamoDB slice the table as multiple chunks of data for faster reads. It automatically creates partitions on every 10GB of data or when your application exceeds 3000 RCUs or 1000 WCUs in a single partition. Generally, DynamoDB will try to evenly distribute the RCU and WCU capacities across the partitions.

🔑 Primary Key (HASH) and Sort Key (RANGE)

The primary key determines the partition to store the data, so it is important to get the primary key correct to be able to distribute the data across the partitions for better performance.

DynamoDB can have two keys

Primary Key
Sort Key

Primary Key can either be simple or composite keys. Composite keys consist of both a partition and a sort key.

Sort Key helps to organize the data in DynamoDB. So sort keys logically group the data in one place for efficient querying purposes.

🤖 Secondary Indexes

DynamoDB allows your application to query the table using an alternate key, in addition to queries against the primary key. The two types of indexes are as follows:

Global Secondary Index (GSI)
An index with a partition key and sort key that can be different. It only supports eventually consistent reads, and it can be created or modified at any time.
Local Secondary Index (LSI)
An index that has the same partition key as the table, but a different sort key. It supports strongly or eventual consistent reads, and it can be created only during the initial table creation.

🚀 DAX

DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB.
DAX can reduce the response times to microseconds.
DAX only supports the server-side encryption
DAX is not ideal for write-intensive workloads

⛑️ Backups and Monitoring

DynamoDB provides on-demand backup capability as well as enable point-in-time recovery for your DynamoDB tables.
Backups includes the data, indexes, streams and provisioned capacity of RCU's and WCU's
DynamoDB usage can be monitored using CloudWatch and CloudTrail.

Building a Twitter Bot with AWS Serverless

Mohamed Fayaz — Tue, 03 Aug 2021 05:20:01 +0000

Sharing my experience of how I am running the Twitter Bot on AWS for almost free using AWS serverless technologies.

Social Bot is nothing but a program that mimics the behaviours of a user based on the rules you set. Twitter is one of the BOT friendly platforms where you can automate a lot of things from liking, re-posting, follow, unfollow and messaging using their API.

Before we get down to the nitty-gritty of how it works, there is an official guidelines from Twitter to follow whenever we create a bot.

My Bot account

// Detect dark theme var iframe = document.getElementById('tweet-1409561715813982212-537'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1409561715813982212&theme=dark" }

I've created this account last year June 2020, but it outperformed my other account which I have been using for 10+ years with a lot more followers, tweets and engagements so the same content I post in this account reaches much more audience than the actual one.

To create your bot all we need is a Twitter account and an AWS account to host your application.

Solution Overview

AWS Lambda

Lambda is a serverless compute service from AWS that let you write and run your own code without managing any servers. Lambda comes with 1 million execution and 400,000 GB-seconds of compute time for free. It also supports multiple languages such as Node.js, Python, Java and many more. I've used Python to develop this bot which will run multiple times a day to source the best content based on the keywords and hashtags.

AWS Secrets Manager

Secrets Manager helps to keep our secret keys secure within AWS. The cost is based on the number of keys so it may cost you ~$1 to keep our Twitter access keys securely.

Alternatively, you can also keep the keys in your Lambda environment variable which won't cost you anything, but never store your keys straight in the code as a best practice.

AWS DynamoDB

DynamoDB is a NoSQL database from AWS which allows you to store the data in key-value pairs. I am using this to store the metadata of the content I shared on my Twitter to avoid repetitive content in my feed. Since am using on-demand, the cost is based on the read and writes so in my case it's costing me less than $1 per month.

AWS CloudWatch

CloudWatch is a monitoring and observability tool which provides data to make actionable insights to monitor our application. I am writing all the Lambda logs back to CloudWatch to keep track of how it is progressing. We can run this within the free tier so there is no additional cost to keep these logs.

Slack

Lastly, I am using Slack to monitor the content it shares. Now you may have a question why can't we directly use the Twitter app to check now and then, having that comes as a message in Slack is so much more comfortable than checking my Twitter feed.

Final words

Serverless architectures provide great benefits, but mainly the scalability and the cost. There are paid services to manage your Twitter content, but to run our own is nearly free using a platform like AWS. Similarly, there are several applications and use-cases you can develop using serverless technologies.

AWS Managed Airflow for your complex workflows

Mohamed Fayaz — Sun, 18 Jul 2021 01:43:38 +0000

Airflow is a tool to orchestrate complex workflow which was created at Airbnb in 2014. Airflow gained significant traction across several organizations in recent days due to the ability to create complex data pipelines with ease. The Airflow user interface (UI) serves as an operational dashboard to schedule, monitor and control any scripts or applications.

Despite Airflow has been adopted by several organizations, deploying and managing the infrastructure has always been challenging and introduces operational overhead. To combat this problem many companies came up with the idea of introducing managed Airflow Infrastructure such as MWAA from Amazon Web Services (AWS). AWS takes the responsibility for keeping your infrastructure with almost no downtime and up-to-date security patches in place along with the latest version readily available to use.

Besides, the other key benefits of using AWS MWAA is its elasticity, the ability to scale up and down based on the workload, and the easiness of building and deploying production-grade secure infrastructure with seamless integration with other AWS Services.

Airflow Basics

Let's dive into some key concepts of Airflow :)

In Airflow, workflows are defined in a Python file which is also referred to as DAG. You can imagine a DAG as a single job that can have multiple tasks in general. There are three common parts in every DAG those are as follows:

DAG Initialization
Tasks
Tasks Dependencies

DAG (Directed Acyclic Graphs)

DAGs are written in Python and are often identified by their unique dag_id. During the initialization, we specify when to start, scheduled time and so forth. Here is a simple DAG below:

from airflow.models import DAG
from airflow.utils.dates import days_ago

dag = DAG(
          dag_id="sample_dag",
          start_date=days_ago(2), 
          description="Sample DAG",
          schedule_interval='@daily')

Task

Tasks perform different actions from executing a piece of Shell script to triggering EMR jobs. It is necessary to have a DAG before we create any task. Also, every task in a DAG is defined by an operator and similar to dag_id, the task_id would need to be unique within the DAG.

def function_a (**kwargs):
        name = kwargs['name']
        return f'hello {name} !!'

first_task = PythonOperator(
        task_id="first_task", 
        python_callable= function_a,
        op_kwargs= {'name': 'Fayaz'}, 
        dag= dag)

second_task = DummyOperator(task_id="second_task", dag=dag)

Task Dependencies

Now the last part of the DAG is to create dependencies among the tasks. In this case, we are going to trigger the first_task first and then trigger the second_task as soon as the first_task completes. So it will look like this:

first_task >> second_task

Now we understood what is Airflow and how to create a simple DAG so let's spin up the AWS MWAA to run this DAG.

You need to use your AWS Account to perform the next few steps which may incur some charges.

Setting up the Managed Airflow Instance in AWS

Before we create a new MWAA environment, we would need to create an S3 bucket which must have versions enabled.

Step 1. Go to Managed Airflow Console and click Create Environment

Step 2. Enter a name and choose the Airflow version as 2.0.2(Latest)

Step 3. Choose the S3 bucket - the one you have created
Step 4. For Dags folder type s3://{your-bucket-name}/dags
Step 5. Click Next

Step 6. Click on Create MWAA VPC

Step 7. It will take you to a page with a bunch of VPC, Subnets details. Click Create Stack which may take a few mins to complete

Step 8. Choose the VPC you just created and scroll down to enter environment class and other configurations

Step 9. Choose Create a new role and click Next

Step 10. Verify all the details and click Create environment

Generally, it takes 10-20 mins to spin up the Airflow Infrastructure so this is the time to get your coffee ☕ before we deploy our very first DAG 🤣😂

🎉🎉🎉 When you refresh in a few minutes, you will see the environment status as Available so click Open Airflow UI

Yay!! Now we got our environment up and ready to go. So let's deploy our first DAG.

Deploying DAGs in the AWS MWAA

Step 1: To deploy the DAG, we would need to copy the .py file to our s3/dags location. Copy the below code and put that in a .py file and save it as demo_dag.py in your local.

"""
Importing necessary modules
"""
from airflow.models import DAG
from airflow.utils.dates import days_ago
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator


"""
Initializing DAGs
"""
dag = DAG(
          dag_id="grepy_sample_dag",
          start_date=days_ago(2), 
          description="DAG which orchestrates a simple ML workflow",
          schedule_interval='@daily')

"""
Creating Tasks
"""
def function_a (**kwargs):
        name = kwargs['name']
        return f'hello {name} !!'

first_task = PythonOperator(
        task_id="first_task", 
        python_callable= function_a,
        op_kwargs= {'name': 'Fayaz'}, 
        dag= dag)

second_task = DummyOperator(task_id="second_task", dag=dag)

"""
Dependencies
"""
first_task >> second_task

Step 2: Upload the demo_dag.py file to your s3/dags folder.

Step 3: That's it!! Now again it may take a few minutes when you deploy a DAG for the first time, but it will look like this

When you toggle on for the first time, the DAG will automatically be triggered so click on the DAG name which will take you to the tree view where you see the job status and task dependencies.

One of the best parts is you can see all the logs when you go to the Graph view which is also available in CloudWatch so you can ingest that to your Splunk or ELK for further analysis.

Conclusion

In this post, we took a high-level look at Airflow basics and we dived into AWS Managed Airflow along with a sample DAG deployment which runs a simple Python function. Similarly, you can orchestrate any type of task using various operators which is available on any Airflow infrastructure.

For further reading

https://aws.amazon.com/managed-workflows-for-apache-airflow/getting-started/

Thank you for your time and happy learning !! 😊