jicoing

Posted on Feb 13, 2021 • Edited on Jul 2, 2023

17. Mainframes vs Cloud Computing

#aws #cloud

In this article I am going to highlight the key challenges I faced while working in mainframe/DB2 support and how cloud technologies can negate them. Things I learnt while making my CloudResume.

Business as usual.

While I was working as a contractor for a US based health insurance company**, I often used to get bombarded with calls over the weekend asking for assistance. I was the only offshore POC for a set of business critical batch jobs.

** All object names are arbitrary.

Architecture

Transactional data from various third party sources/service providers came into the staging area which updated the MEM* table in base mainframe using CDC during business hours.
Then the table data was loaded into MEM* table in clone mainframe by a 4 hour nightly cycle (12am - 4am) CST.
The consolidated previous days data was consumed by business from 7am CST onwards through frontend, for processing health claims of members.
These 70 odd batch jobs were business critical as any failure in any one of them would stop the entire batch. We had to fix the abended jobs and restart the cycle. However, there was a SLA deadline of 4 am, beyond which the frontend would access 1 day old data (from the mainframe) and the claims wouldn't get processed, leading to escalations.

*MEM table - used to consolidate health care plan information of members.

1. Space Issue

The primary challenge that we usually faced was DASD space issue - SB37 abend. These jobs would frequently abend due to a lack of storage.
The batch jobs were a set of sequential unload jobs which unloaded records of MEM table from base mainframe into 70 datasets. These datasets were later loaded into the clone mainframe with the separate load jobs.

Example - UNLOAD JOB 1:JCL.

space parameter

                         //SPACE=(CYL,(20,10),RLSE) --> Defines the space parameters with primary quantity = 20 ,
                         secondary = 10. Total = (20 + (10*15)) = 170 cylinders allocated.
                         Highlighted - Control card dataset used to pass LIMIT key.

Formula - to calculate space parameters

                     Number of cylinders needed = (number of records * LRECL of dataset) / 849960  
                     Records : 340000 
                     LRECL : 214
                     Bytes per cylinder(standard): 849960 
                     Number of cylinders : approximately 86 cylinders needed for 340000 records

The problem was business was not able to confirm the approximate spike in the number of records day to day and finding a correct space parameter for all jobs was becoming difficult. And if we decided to set a primary quantity too large in JCL, MVS was not able to allocate extents due to other application jobs also requiring space in the same LPAR.
I had to coordinate with third-party storage teams to set the correct space parameter rerun the jobs once they allocated additional (DASD volume).
By the time we fixed the issue, SLA of 4am breached and I had to explain to the management and onshore.

Storage team was reluctant to increase DASD volume allocation permanently for our jobs as storage is a premium, so they wanted justification, however our business team was unable to confirm future spike in records. So triage calls went on forever without any permanent fix.

Why Cloud?
                 S3 --> Lambda --> AWS Batch --> RDS
The source files can be stored in S3. Once they get pushed to S3 we can configure S3 SNS event notification to a separate lamdba function. However lambda function can only run for max 15 mins, therefore we need AWS Batch for long running SQL workloads. The lambda function will trigger AWS Batch to do the work.

As AWS Batch provisions and manages resources (EC2 + EBS storage) it scales accordingly, so we don't have any resource constraint or space issues. Once the batch processing is done all alllocated infrastructure will be removed, thus saving cost.

Once the batch processing is over the data can be pushed to AWS RDS for storage which can be in multi-AZ deployment having synchronous backup.

2. LIMIT key partitioning

As the records of MEM table containing the consolidated member data of our application was getting loaded with new records we had to manually partition the UNLOAD jobs based on MEM_ID field as the LIMIT key. This helped avoid some jobs consuming more resources than others.

Example - UNLOAD JOB 1:CONTROL CARD DATASET
LIMIT key set as MEM_ID.

Before - Job 1 was doing all the unloading and consuming more resources.

After - Jobs are evenly distributed to unload records based on new Limit keys.

Why Cloud?

Amazon DynamoDB stores data in partitions. Partition management is handled entirely by DynamoDB—you never have to manage partitions yourself.

3. CPU - soft capping

There were several instances when the batch jobs ran slower and consumed higher Elapsed times. On further investigation we found that the mainframe performance & capacity team, had enabled softcapping for our application jobs without informing us. This was done to provision more CPU time to other application jobs running in the same LPAR, and sometimes to avoid heating issues.

This did not cause any job abends, however it delayed the entire process due to slow running jobs, thus causing SLA breach due lack of compute resources.

Why Cloud?

With cloud services we can auto-scale our infrastructure and automate capacity management, thus avoiding situations like these.

4. Fragmented regions

The 70 odd batch jobs were taking significantly longer unload times in pre-production regions. This is due to senior DBA's forgetting to perform REORG on the base mainframe tables in those regions, after production data migration. There was also discrepancies in the index definitions for tables in those regions compared to production tables.

Why Cloud?

This scenario can be avoided in the cloud with help of uniform Cloudformation templates deploying infrastructure across regions. IAC.

5. Access restrictions

While working for the offshore account, there were several occasions when business wanted me to query unrestricted production tables or deploy JCL changes in production to fix certain issues. However my TSO ID did not have access to the tools (SPUFI, Changeman) needed to do that. This was enabled across entire offshore account. So I had to wait for my onshore counterpart to perform those changes, causing delays.

Why Cloud?

Granular IAM policies for READ/WRITE access to a resource ARN can be attached to a specific user, in a group.

6. Scheduling

The mainframe batch jobs across regions were scheduled using a ASG-ZEKE tool. Schedule changes across regions were handled by a separate contractor and we had to traige with them in case of any minor changes.

Why Cloud?

I found CI/CD pipelines and job scheduling easy to setup in Github actions and the way it implements workflows for inter-dependant jobs is easy to track.

7. Maintainence

Although mainframes are extremely robust machines, they do require maintainence. If any jobs are running during those timelines they are cancelled(S222 abend) by datacenter, due to LPAR being taken down. So we had to analyse the list of high impacted job prematurely and put them on hold. This was a tedious process as maintainence timelines vary. Also we were unable to query tables needed for resolving incidents or fix defects during those hours.

Why Cloud?

With a distributed infrastructure across several regions, there is better hardware backup incase of failures or outage.

8. Database

While working as a junior DBA in one of my earlier projects, I saw the challenges among developers and the DBA's. The DBA's usually have DDL,DML scripts which need to be managed on regular basis after triaging with developers.
In a RDBMS like DB2 an admin usually needs to configure the database, define the schema, tablespace, field data type and length while creating a table and then grant access to the application developers.

Sample DB2 SQL

                  create table  MYSCHEMA.MYTABLE (
                  MEM_ID CHAR(5) NOT NULL,
                  MEM_NAME CHAR(50) NOT NULL,
                  EMAIL CHAR(8) NOT NULL,
                  UPD_DTE DATE NOT NULL,
                  PRIMARY KEY (MEM_ID)
                  ) in database DB1234.TS1234;

                  grant INSERT,UPDATE,SELECT,DELETE ON MYSCHEMA.MYTABLE TO GROUP;
                  grant INSERT,UPDATE,SELECT,DELETE ON MYSCHEMA.MYTABLE TO USER;

Select queries written must be optimized for better performance. This requires additional knowledge of the DB2 optimizer.

Why Cloud?

A NoSQL database like DynamoDB works primarily on key value pair, requires only a primary key and doesn’t require a schema to create a table. It doesn't have any administrative burdens of operating and scaling a distributed database. The table can be created using Boto 3 in python or SAM template. There is no need for separate DBA's to perform these tasks. However, DynamoDB is not a RDBMS and does NOT support SQL queries like them.

Sample table - MYTABLE
                          SELECT * FROM MYTABLE
                          WHERE EMAIL ='username@email.com'; 
To perform a similar operation in DynamoDB we can perform table scan on the entire table, which is costly and not recommended. The other way is to define a hash key on mem_id attribute, and create a GSI - Global secondary Index on email attribute, in SAM template.

Then we can search the desired rows with Boto3 in python as per application requirement.

Using: query

I got these concepts from here and here.
NoSQL needs a different approach compared to RDBMS. There is also traditional RDBMS support in the cloud using RDS.

9. User Interface

The ISPF interface can be challenging to navigate for repetative tasks. However VSCode editor support was recently added for Z/OS Mainframes. Z/OS only allows a single session for a TSO userID, so we cannot open multiple sessions across different terminals/devices.

Why Cloud?

I can use the editor of my choice for development. I can pass my IAM credentials as environment parameters while deploying code through a virtual machine, which blew my mind.

Those were some of my observations. Thank you for reading..

DEV Community

17. Mainframes vs Cloud Computing

Business as usual.

Architecture

1. Space Issue

Why Cloud?

2. LIMIT key partitioning

Why Cloud?

3. CPU - soft capping

Why Cloud?

4. Fragmented regions

Why Cloud?

5. Access restrictions

Why Cloud?

6. Scheduling

Why Cloud?

7. Maintainence

Why Cloud?

8. Database

Why Cloud?

9. User Interface

Why Cloud?

Those were some of my observations. Thank you for reading..

Latest comments (0)