DEV Community: Christopher Thompson H.

How to combat climate change with data in AWS

Christopher Thompson H. — Mon, 20 Dec 2021 19:43:43 +0000

Hello data Lovers! this blog will talk about some initiatives driven by AWS technologies that allow us to analyze and prevent some of the most significant effects of climate change in the world. Each section has its respective source to learn about the initiatives directly, and it's only a compilation of what already exists on the web. 😄

Data and its analysis are increasingly crucial for the urgency of measuring, model and monitoring global climate change. Organizations multi-laterals, governments, non-governmental organizations, and companies worldwide are committed to the compilation and generation of databases that support the fight against global warming.

However, researchers are increasingly using new tools with higher availability and accessibility based on Cloud computing's technology, like analytics advanced resources for accelerating real-time monitoring.

Digital information for decision making in Africa

Digital Earth Africa is a program that promotes access to Earth observation data that allows African countries uses Sattelite's information about floods, droughts, soil, and coastal erosion, agriculture, land cover forests, land use, among other services. The users can analyze the critical data in minutes after they become available.

Thanks to the AWS initiative Amazon Sustainability Data Initiative (ASDI), this information has been supported and endorsed by the various entities within this program.

The success story published by AWS can be found in the following blog:
https://aws.amazon.com/es/blogs/publicsector/digital-earth-africa-enabling-insights-for-better-decision-making/

Melting of Peruvian glaciers in real-time

Perú represents approximately 68% of the tropical glacier's mass, which has reduced by less than half over the last 40 years. The Instituto Nacional de Ecosistemas de Glaciares y Montañas (Inaigem), administrated by the state, uses machine learning and artificial intelligence tools for analyzing the compiled data in real-time. All of this is in the most vulnerable glacial lakes for calculating the probability of possible avalanches, shortening the answer time, and issuing alerts to prevent accidents and harm to the population.

Thanks to AWS Technologies, it is possible to collect the information in real-time (with seconds difference) through sensors consolidating data in a central repository and data lake in AWS, generating alerts through messages services for possible avalanches or landslides.

If you want to find out more, these are the links:
https://tc.copernicus.org/articles/13/2537/2019/
https://www.aboutamazon.com/news/aws/tracking-the-disappearing-glaciers-of-peru

Shark and sea state monitoring

95% of the ocean is unexplored, and the lack of data will affect conservation efforts. Non-governmental organization Global Ocarch borns to help scientists deal with the previously unavailable information. Cloud Computing is for storing and sharing data with Sattelite telemetry above the shark movement through the Ocarch Shark Tracker and Ocarch tracker applications in his site. This information allows most of 180 scientists of 90 organizations to progress in 23 different research projects.

OCEACH uses Amazon Simple Storage Service (Amazon S3) to store its recompiled data. Amazon Relational Database Service (Amazon RDS) for its shared database, Amazon Elastic Compute Cloud (Amazon EC2) for the computing power, and Amazon Route 53 how domain name system.

The success story published by AWS can be found in the following blog:
https://aws.amazon.com/es/blogs/publicsector/assessing-oceans-health-monitoring-shark-populations/

Securing the future for the Tazmania devil

This marsupial is threatened for changes provoked by humans or devasting fires and for infectious cancers that can cause facial tumors and reduce their number by more than 80%. The Cloud has accelerated the work of Sydney University Experts using data of the Tasmanian genome. Those analyses will use for researchers worldwide, and search helps protect those marsupials and other endangered species.

The team's work has accelerated since the start of a proof of concept in AWS. According to the team, it allowed them to speed up the investigation and manage the finances carefully.

If you want to find out more, these are the links:

https://www.zdnet.com/article/university-of-sydney-using-cloud-to-prevent-the-tasmanian-devil-from-extinction/
https://aws.amazon.com/es/opendata/open-data-sponsorship-program/

Other interesting cases

Saildrone

Saildrone Website

https://aws.amazon.com/es/solutions/case-studies/saildrone-video-case-study/

CMIP6 dataset to foster climate innovation and study the impact of future climate conditions

https://aws.amazon.com/es/blogs/publicsector/now-available-cmip6-dataset-foster-climate-innovation-study-impact-future-climate-conditions/

Best practices for AWS Athena

Christopher Thompson H. — Mon, 04 Oct 2021 05:44:01 +0000

In this blog I will mention some of the best practices recommended by AWS for building queries in Athena based on my experience and the following resources:

General Recommendations

Always use WHERE on partition field

This mainly in order to speed up time and cost.
For example:
Avoid:

select * from table1 where cast(col_1 as integer) = cast('201912' as integer) - 1

and prefer:

select * from table1 where col_particion = '201911' //speedup ~85%, savings ~95%

Avoid using ORDER BY without LIMIT

It is extremely important to understand that the ORDER BY function must be done in a single node, since it is a slow and time-consuming function. Ideally it should be avoided, however, if within the use case you are implementing you must use it, I always recommend placing a LIMIT.
For example:
Avoid:

select * from table1 order by date;

and prefer:

select * from table1 order by date limit 1000; //speedup ~98%, avoid 'Query exhausted resources at this scale factor'

Select only the columns to retrieve the final result

This recommendation is very simple. In practical effect it is to avoid the SELECT * FROM.
For example:
Avoid:

create table tmp_table
as select
A.col_1
A.col_2
B.col_3
from select * from table_1 A
left join (select * from table_2) B on A.col_1 = B.col_1

Instead use:

create table tmp_table
as select
A.col_1
A.col_2
B.col_3
from select col_1,col_2 from table_1 A
left join (select col_1,col_3 from table_2) B on A.col_1 = B.col_1

Schedule data aggregation for small files

The numbers speak for themselves:

Query	Number of files	Run time
SELECT COUNT(*) FROM lineitem	5000	8.4 seg
SELECT COUNT(*) FROM lineitem	1	2.31 seg
Speedup		72% faster

Prefer the use of regular expressions over 'LIKE'

Query	Run time
SELECT COUNT(*) FROM lineitem WHERE text_column LIKE '%wake%' OR text_column LIKE '%some%' OR text_column LIKE '%express%' OR text_column LIKE '%hello%'	20.56 seg
SELECT COUNT(*) FROM lineitem WHERE regexp_like(text_column,'...')	15.87 seg
Speedup	17% faster

Note: The expression would be

regexp_like(text_column, 'wake|some|express|hello')

When using group by for multiple fields. order them from highest to lowest cardinality

This will avoid memory errors and reduce the time to deliver results.
For instance:
Avoid:

select * from people group by column_genre,department;

Instead use:

select * from people group by department,column_genre;

In case of using Crawlers to automatically obtain the structure of the data stored in S3, respect the data types supported for the source engine. Likewise, do not forget to run the crawler after a data update that may generate changes in the structure, this in order to update the structure in the glue catalog.

Use MSCK REPAIR TABLE only if the folders are created with the structure 'field1 = / field2 = /.../ fieldN = ' and only after creating the table, since it is' msck repair table 'is expensive operation and it is preferable to use' alter table add partition 'or glue api to add partitions.

and do you have any good practice that you recommend? Comment it in the comment box.

I hope this blog is useful for you. Greetings!

Importing metadata from the AWS Glue data catalog into Apache Atlas with EMR

Christopher Thompson H. — Thu, 26 Aug 2021 06:22:52 +0000

What is going to be implemented

We will implement Apache Atlas through the AWS EMR service by connecting the Hive catalog directly to the Glue service, being able to dynamically classify your data and see the lineage of your data as it goes through different processes.

Presentation of services to use

Amazon EMR is a managed service that simplifies the implementation of big data frameworks like Apache Hadoop and Spark. If you are using Amazon EMR, you can choose from a predefined set of applications or choose your own from the list.

The Apache Atlas project is a set of core governance services that enables companies to effectively and efficiently meet their compliance requirements with the entire enterprise data ecosystem. Apache Atlas provides metadata governance and management capabilities for organizations to catalog their data assets, classify and control these assets, and provide collaboration capabilities internally. In other words, help teams make the cycle of their own data transparent. For this reason, it is important in some business or architectural solutions to have these mechanisms of transparency and governance of their own data, in order to make the most of the knowledge of their data, through predictions in different ways. For example, market predictions, customer safety sessions, generation of impact campaigns, among many other ways to take advantage of the behavior of your data.

Of the many features that Apache Atlas offers, the main feature of interest in this article is Apache Hive's data lineage and metadata management. After a successful Atlas setup, use native tools to import tables from Hive, analyze your data, and intuitively present your data lineage to your end users.

Implementation Steps

1.- Create glue-settings.json configuration file

The first thing we will need to do is create a .json file with the following structure on our local computer:

[
  {
    "Classification": "hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  }
]

2.- Preparation of environment in AWS (Review of default_emr_role and network infrastructure)

This step is important for when we launch an EMR for the first time through the AWS CLI, especially for our command that we will execute in the following steps. The reason for this step is that when you start an EMR cluster you need to assign it a role, however when you first create it, that role is automatically created with the name default_emr_role.

This is easily solved by launching a test cluster through the AWS Management Console. When you launch the cluster for the first time through the console, it will automatically create the default_emr_role role for you, which you can use with the lifting of our original cluster.

Then you can go directly to the IAM service and check if the default role is already created.

Advanced tip: If you want to implement Apache Atlas in a limited and productive scenario, you must create a new role with the least possible privilege for EMR, which will be the one you will use to execute the following steps.

Don't forget to delete the test cluster that you used to create the role.

3.- Prepare parameters to create and run EMR cluster

This step is important for the execution of the following code. The parameters to define are the following:

Cluster_Name: The name of the cluster you will need
Instance_Type: The type of family that each node will have
Instance_Vol_Size: The size of the EBS that is configured with the EMR
Key_Name: The name of the key pair created for the use and connection of this EMR
Subnet_id: The id of a subnet to use for this EMR
S3_EMR_LOGDIR: EMR machine log location

In my case, the parameters that I will choose are the following:

CLUSTER_NAME=EMR-Atlas
INSTANCE_TYPE=m4.large
INSTANCE_VOL_SIZE=80
KEY_NAME=key-0a97d3c96668decaf
SUBNET_ID=subnet-09de17cf9eb1c56d3
S3_EMR_LOGDIR=s3://aws-logs-39483989-us-east-1/elasticmapreduce/

To obtain the subnet, you must go to the Amazon VPC service and obtain the ID of the subnet that you are going to use. You can also do it by command with the AWS CLI. For more information I leave you the following link: https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html

4.- Create EMR Cluster with AWS CLI

After having everything configured, the EMR cluster is created through the AWS CLI. It is important to note that these steps could be carried out through the AWS management console, decomposing the command according to the configuration options that are made from the interface. In my case, I find it easier to use the AWS CLI.

The command with all our previously defined configurations would be the following:

aws emr create-cluster --applications Name=Hive Name=HBase Name=Hue Name=Hadoop Name=ZooKeeper --release-label emr-5.33.0 --instance-groups  InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=1,InstanceType=m4.large --use-default-roles --ebs-root-volume-size 80 --ec2-attributes KeyName=apache-atlas,SubnetId=subnet-0d95c4cdf3119f9ae --configurations file://./glue_settings.json --tags Name=EMR-Atlas --name "EMR-Atlas" --steps Type=CUSTOM_JAR,Jar=command-runner.jar,ActionOnFailure=TERMINATE_CLUSTER,Args=bash,-c,'curl https://s3.amazonaws.com/aws-bigdata-blog/artifacts/aws-blog-emr-atlas/apache-atlas-emr.sh -o /tmp/script.sh; chmod +x /tmp/script.sh; /tmp/script.sh'

This will create an EMR cluster that you can monitor if you want from the AWS management console.

5.- Modify import-hive.sh script in EMR cluster

When we have the cluster up and running, we must enter the cluster with any of the various possible forms of connection. In my case I use an SSH connection. If you want more information about the steps I leave you the following link: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node.html

When you are already inside the cluster, you must execute the following commands in order:

sudo cp -ai /apache/atlas/bin/import-hive.sh{,.org}

sudo vim /apache/atlas/bin/import-hive.sh

This in order to modify the import-hive.sh file. You could also use another editor that suits you other than vim.

When you are inside the import-hive.sh file, you must make the following changes:

You will have to change this line of the file:

CP="${ATLASCPPATH}:${HIVE_CP}:${HADOOP_CP}"

For this:

CP="${ATLASCPPATH}:${HIVE_CP}:${HADOOP_CP}:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar"

With the objective that Glue catalog to read directly to the base Atlas catalog.

6.- Importing the Glue data catalog to Atlas

Run the modified script to import the Glue metadata into Atlas.

The user is admin and the password is admin:

/apache/atlas/bin/import-hive.sh


Enter username for atlas :- admin
Enter password for atlas :-

2021-08-25T13:58:234,43 INFO [main] org.apache.atlas.hive.bridge.HiveMetaStoreBridge - Successfully imported 5 tables from database aws_db
Hive Meta Data imported successfully!!!

In this way you will have already imported the glue catalog into Atlas.

Advanced tip: If you want to automate this catalog update you will have to run the import-hive.sh shell file again.

7.- Connection to Atlas

Finally, you must build a tunnel locally from the EMR in order to build an endpoint to connect to the atlas interface. For this, run the following command:

ssh -i apache-atlas.pem -vnNT -L 21000:localhost:21000 hadoop@{ip_of_your_cluster}

With this to connect to the interface you can access the following link:

http://localhost:21000

The login screen will be displayed as shown below: Login with the password admin and user admin.

Already inside the interface, if you search for hive_table you will find the information of your glue catalog:

References Links

What is the value of learning to learn in AWS

Christopher Thompson H. — Wed, 09 Jun 2021 01:07:23 +0000

Human beings learn to modify their skills and abilities through observation, practice or reasoning. This is called "learning". In simple words, from a young age we learn to acquire knowledge so as not to make the same mistakes twice, or to simply be better people.

Approaching this concept from the world of work, it is very common at the beginning of our career to leave a university or in the course of it to leave wanting to apply what you have learned over many years of effort. However, when you leave a world of learning what the study of a profession is like, you enter a deeper and more intriguing one, realizing that what you know is only 1% of what you could know, and that in reality everything is more challenging to how they taught it to you.

Depending on where you arrive, it will be partly how you develop or how you learn to learn. Yes, I said "learn to learn." Your first job will be essential to know how to develop your skills or start your ninja path. However, this is a double-edged sword. Why learning to learn can be a double-edged sword? It is simple. If you learn the wrong way, then it will be much more difficult for you to adjust to the real world when you leave that job. I'm telling you because I was about to fall down that road. That is why I will teach you how to detect it in time and how to take the right path.

Step One: Learn to Be Your Own Center

One of the hardest things when you're inexperienced, rookie, young Padawan, inexperienced licensed professional, enthusiastic youth, or how you put yourself on your resume, is to question the why of things. This is because if we are arriving as inexperienced we will need to observe others how they solve problems or how they face them (Learn from them).

Over the years this idea has been mutating due to generational changes that have occurred. It can occur in different situations in the working world, for example: When a leader or boss tells you to start a task, and details the problem, reason, but never tells you how to solve it and only tells you "Solve it for tomorrow." Vualah! This is a clear clue that it will not be a guided path and it will be a path where you will have to figure out how to solve or fulfill that task.

If you are someone proactive, it will be very useful to ask that same leader how you could solve it or who you should look for to know how to do it. But if you are a not so proactive or shy person, with a hope that they will guide you in your early days, it will be very difficult for you to do that. On the contrary, you will most likely feel disappointed that they did not actually tell you how to solve something, even saving as a resource or answer the fact that they did not teach you how to solve it and that for that reason you have not done it yet. It is important to note that the proactivity of current generations have been changing the reason to be proactive, and older generations increasingly find it more difficult to understand what makes people or their teams proactive.

Therefore, you must learn to be your own center by looking around how others work to know your reason for being proactive. If you come to a job for the first time, you will have to understand that even if they have a “culture” or an “attempted culture” actually imposed by whoever created the company, it does not mean that you cannot get out of that circle and be better than that. If you come to conclude that with your analysis, that the culture that the company follows is a passionate and attractive culture to your way of thinking, you will be in the right place. But ... What if you don't like that culture? What if you think that the way of working of your first job does not seem correct to you? Should you accept it and learn that way? You will clearly have to improve it. And if you can't do it or they won't let you, it means you're not in the right place. And don't worry, improving it does not mean that you will have a war. On the contrary, it will improve your workplace even generating good results.

Some time ago, talking with some guys who left my current job, I realized in reality that they did not leave because of an “opportunity” that came from nowhere or by magic. Rather, they decided to show their profiles as open to receiving opportunities. Who has not done it? When you already want to leave a job, you put on professional networks that you are open to receiving offers. But in reality this has always been a choice. You decide when you want to receive offers or when you want to find out more about them. Outdated companies and boomers don't take the time to understand why they are doing it. They only focus on the superficial, on the "He has not returned our hand" or on the "He lacked being more committed to us." But they never really wonder why he decided to accept the other offer or why he decided to find out more about them. You agree to be part of a process that can get you out of that circle that you don't want to be in.

That is why (and hopefully pass it well) it is important to know that should not scare you to decide your own path. To decide to be your own center, where you will have to create the trust to decide if the business culture that is imposed on you is the one that really represents you. This is how a real culture is formed.

And now you will ask yourself: Why is this young man telling me all this? Well, because here comes the next point: What if I like where I am, but I am learning the wrong way? How do I know if I am learning the right way?

Step 2: Decide how I want to learn

Something that I can rescue from my current job is that I learned to know how I want to learn. I was saved from a learning obligation because I discovered that I want to be better than I am used to being. The idea of studying the wrong way, going the easy way, studying for a medal, studying or memorizing test questions, or studying to "meet a business goal" didn't suit me. Discovering this was not easy.

In the beginning I was part of that system. I felt strange because I was actually able to learn that way and generate results. However, I felt that learning the fast way, when speaking with professionals who learned the slow way, the difference was huge. I could pretend that I knew a lot or that even with my medals I felt with the ability to face any question since they trusted me. But actually, they knew more than I did, and in a way, it showed. Therefore, some external professionals helped me understand which path to choose.

I had the opportunity that during the start-up process, I met very professional people who helped me understand how I should decide how to learn. If it hadn't been for those people, I'd practically still be within the imposed system and maybe I'd still just be someone with medals. They had a study plan guided in the right way and with the right people. In the context that I work in, which is the world of Amazon Web Services, these people were the best people who could tell me how I should learn from AWS. The Amazon Professional Services team. They were professionals prepared to face any challenge in the cloud, and who would leave any partner company defeated if they made a competition of who was cooler. But in reality, they were normal people who had understood how they should learn, because Amazon as a company has a very interesting culture regarding people's learning. They showed me some secrets regarding how they learned, what was the platform, how they related to their peers, their networks, etc. An almost hidden world for a mere mortal outside of Amazon. Seeing this made me realize that learning the fast path was really only a long-term sentence. Since then, I never wanted to follow that path again and became independent from my knowledge.

What if they hadn't helped me? What if you don't have someone to help you? Well, I'm sure that understanding that you are the center of your attention, you will still discover how you want to learn. Not so naturally or spontaneously, but anyway you would have come to the same conclusion that everyone must understand which paths exist and which one to choose. In fact, by reading this post you are getting a boost from someone who wants to help you.

Step 3: Grow

If you were already partially able to figure out the conclusion of step one and two, I congratulate you. You are already growing. You are already learning. Very few people are able to do it, and very few know how to do it. While there will never be an owner of the truth and I clearly am not, there will always be someone who can guide you to discover your own truth. Don't let others take ownership of your truth, or force you to follow their truth. By reading this post, it is a form of growth.

And rest assured, taking these steps will not always be easy. I never said it was. You will have to make difficult decisions, or think of optimal strategies to advance to a better place. Sometimes you will feel exhausted, distant, or disappointed. But you should know that in the end, the result is worth it.

Bonus: Where to look for a correct Study Plan

While I showed you part of my vision for not studying incorrectly, I have not yet told you how to look up your study plan correctly. In this case, I will tell you about my experience to learn about Amazon Web Services, but you can carry out this search or plan with any other topic you want to study.

First of all, I never recommend studying only the questions of some exam. That alternative is the worst they could recommend, because there is no real learning or understanding of the current context of all the services that involve the AWS cloud. AWS internal teams occupy their own platforms to learn, and this is clearly not a problem for them.

My recommendations

This does not mean that you cannot do mock or mock exams, where here comes my first recommendation. Whizlabs.

Whizlabs

Whizlabs is a platform where its greatest power is based on simulation exams of various subjects. They have exams for almost all cloud providers where each question has its justification and its link to official documentation. For me this site was a great help to start simulating a real exam. It also has courses and labs to test AWS services.

Another recommendation that I can give you is QwikLabs.

QwikLabs

This is a site where you can test your skills on administration consoles prepared for the specific laboratory. They are quite useful since it has a global subscription where if you pay for it you will be able to access all the existing laboratories on the platform. Be very careful with doing something that is not in the laboratory instructions, because they can cancel your account.

TutorialsDojo

Finally another of the sites that have helped me a lot to learn the right way, is at TutorialsDojo.

This site is a great place to start running quizzes and practical exams to address most of the existing certifications in cloud providers.

Other platforms that I recommend are:

Courses I Recommend

I tell you that all over the internet there are a variety of courses that will really show you a correct context of AWS. However, one of my biggest secrets of my knowledge is Adrian's courses. These have positively impacted me. These courses come with an extreme dedication to improving the content you create every day.

Each course or sheet is Adrian's own creation with the sole objective of explaining the concepts of the cloud in the best way possible.

The truth is that I have had other courses in other providers such as Udemy, Cloud Academy, aws.training, etc. But I have never seen such a dedicated and practical course as this. It also has many practical examples that will help you understand the concepts of the cloud.

AWS Appflow as a Salesforce Migration Method with CDC

Christopher Thompson H. — Sat, 29 May 2021 06:38:40 +0000

For a long time, a business model known as SaaS has been expanding, which allows software to be distributed over the Internet. This model is an approach that came to replace or complement the traditional business model, changing a focus on the product with a focus on the service.

To get into the context of this post first, we must know Salesforce, which is a service under the SaaS model born a few years ago and it will be the service that we will discuss in this post.

Salesforce is a famous CRM in the market, which provides 360º management capabilities of sales, marketing, customer service and all points of contact, in one place. Many clients have used or have migrated their business processes to this service that allows to keep all the flow of commercial or valuable information for the company centralized.

Along with the growth of the cloud computing concept, these solutions under the SaaS model began to have the need to migrate from their on-premise servers to the cloud. For this reason, the different providers have brought out different services over the years that can help carry out this migration. Among these is Amazon Appflow.

Amazon Appflow is a fully managed no-code integration service enabling seamless and secure data flow between Amazon Web Services (AWS) and software-as-a-service (SaaS) applications. It allows you to source data from AWS services and SaaS applications such as Salesforce, and aggregate them in AWS data lakes and data warehouses to draw unique data-driven insights.

This service presents as its main characteristic the possibility of connecting to various data sources that work under the SaaS model.

New data sources will most likely be added to Appflow support.

However, the important thing about this post is to tell you how this service works to work with CDC (Change Data Capture), and how it has been my experience being one of the many pioneers who developed real solutions within weeks of the product being released.

My Experience with the Release of AWS Appflow

First of all, to protect the integrity of my client who used this service, I am going to mention the client in a generic way so as not to breach any NDA.

The requested use case was to migrate Salesforce objects to a Data Warehouse, with the goal of reducing storage costs in the CRM. At the beginning of my project, the architecture that we had proposed together with AWS was something like the following diagram:

You may wonder why we migrated directly to the Redshift destination and not to AWS S3 to use Redshift Spectrum. The reason for this is simply because when we started the project, AWS Appflow did not yet have all the functionality to go to AWS S3 with upsert option, which meant that we would have to add some additional processing steps that would generate more work time.

The Problem

Thinking innocently, we thought together with the AWS architects, that the service could work according to the customer's expectations. However, the CDC feature was the biggest problem during the project.

The problem was that when a record was updated, the record was duplicated at the destination (Amazon Redshift) generating duplication when making reports with a tool. Although the timestamp is a field that changes, the records that should be unique, such as the Id field, were duplicated.

Important: This implementation was during the update of the AWS Appflow, where in October, 2020 an Upsert option was included to data sources such as Salesforce. Document history for user guide AWS Appflow

The Solution

The solution to this problem was to work with a Change data capture process generated in a stored procedure in Redshift. However, it could also have been:

Do UPSERT processing with a Glue Job
Send the data instead of the data warehouse to AWS S3 and perform UPSERT processing work with services such as Amazon EMR, managing the Data Lake layers (raw, stage, analytics).

Our decision to use a stored procedure was because the nature of the project was to migrate to a Data Warehouse and not go through a Data Lake, since the client did not require this structure yet. In addition, the processing power of our Redshift according to the analyzed metrics, if it allowed to generate this stored procedure with a trigger that was executed with intervals of minutes.

We also had to change the automatic architecture, where instead of the Appflow jobs being automatically executed by the incremental load of the service. Now the architecture was as follows:

Steps in architecture

1.- The first thing was to find a way to automate the execution of flows with full load, simulating being a daily load. AWS team we proposed to work with an example of CloudFormation of AppFlow with time-automation. The project is Amazon AppFlow Relative Time Frame Automation Example

2.- The second step is to start a lambda that will search an AWS DynamoDB control table for all the templates in S3 and configuration of the objects, with the aim of creating the jobs in AWS Appflow. All this flow will be managed under an AWS Step Function, having greater control in case of failures.

The Step Function has the following design:

3.- The lambda_trigger_appflow is responsible for executing a CloudFormation stack which creates an AWS Appflow Job. This lambda also modifies the DynamoDB control table, to be queried by another Lambda called Status_Job. The aim of this Lambda is validate that jobs are created thanks alstack CloudFormation.

4.- The stack creates an AWS Appflow job.

5.- The Appflow Job is created with a Ready to run status.

6.- The lambda start_job_appflow is in charge of starting all the jobs configured for this execution, also validating that they are all created correctly.

7.- After the AWS Appflow jobs have finished, an AWS Glue Job is executed that through boto3 will look for the Redshift credentials to Secret Manager, and this executes through Python code a procedure located in Redshift called consolidated_of_tables. The sentence executed would be the following: CALL consolidated_of_tables (). No parameters are sent to this procedure.

8.- In the execution of this procedure, what it does is consolidate what is found in a temporary schema, called salesforce_incremental, where thanks to functions such as GROUP BY, ROW_NUMBER can be compared with the target tables that would be in the final schema called salesforce. What the procedure does is simply compare the incremental table with the source table, and it anticipates the duplication of records with unique values.

9.- As a final step, after having the data already consolidated in a final schema, a lambda called delete_elements_flows_snapshots is executed as the last step, which what it does is delete the CloudFormation stacks and the Appflow Jobs already created, with the purpose not to overcome the services quota.

Conclusion

This solution, although it could have some defects related to the current capabilities of AWS Appflow, it is a simple architecture to implement and that can be executed in any proof of concept of clients that want to take their Salesforce objects to a DWH such as AWS Redshift.