In this article I am going to highlight the key challenges I faced while working in mainframe/DB2 support and how cloud technologies can negate them. Things I learnt while making my CloudResume.
Business as usual.
While I was working as a contractor for a US based health insurance company**, I often used to get bombarded with calls over the weekend asking for assistance. I was the only offshore POC for a set of business critical batch jobs.
** All object names are arbitrary.
Transactional data from various third party sources/service providers came into the staging area which updated the MEM* table in
basemainframe using CDC during business hours.
Then the table data was loaded into MEM* table in
clonemainframe by a 4 hour nightly cycle (12am - 4am) CST.
The consolidated previous days data was consumed by business from 7am CST onwards through frontend, for processing health claims of members.
These 70 odd batch jobs were business critical as any failure in any one of them would stop the entire batch. We had to fix the abended jobs and restart the cycle. However, there was a SLA deadline of 4 am, beyond which the frontend would access 1 day old data (from the mainframe) and the claims wouldn't get processed, leading to escalations.
*MEM table - used to consolidate health care plan information of members.
1. Space Issue
The primary challenge that we usually faced was DASD space issue -
SB37 abend. These jobs would frequently abend due to a lack of storage.
The batch jobs were a set of sequential
unload jobs which unloaded records of MEM table from
base mainframe into 70 datasets. These datasets were later loaded into the
clone mainframe with the separate load jobs.
Example - UNLOAD JOB 1:JCL.
//SPACE=(CYL,(20,10),RLSE) --> Defines the space parameters with primary quantity = 20 , secondary = 10. Total = (20 + (10*15)) = 170 cylinders allocated. Highlighted - Control card dataset used to pass LIMIT key.
Formula- to calculate space parameters
Number of cylinders needed = (number of records * LRECL of dataset) / 849960 Records : 340000 LRECL : 214 Bytes per cylinder(standard): 849960 Number of cylinders : approximately 86 cylinders needed for 340000 records
The problem was business was not able to confirm the approximate spike in the number of records day to day and finding a correct space parameter for all jobs was becoming difficult. And if we decided to set a primary quantity too large in JCL, MVS was not able to allocate extents due to other application jobs also requiring space in the same LPAR.
I had to coordinate with third-party
storage teams to set the correct space parameter rerun the jobs once they allocated additional (DASD volume).
By the time we fixed the issue, SLA of 4am breached and I had to explain to the management and onshore.
Storage team was reluctant to increase DASD volume allocation permanently for our jobs as
storage is a premium, so they wanted justification, however our business team was unable to confirm future spike in records. So triage calls went on forever without any permanent fix.
S3 --> Lambda --> AWS Batch --> RDS
The source files can be stored in
S3. Once they get pushed to S3 we can configure S3 SNS event notification to a separate
lambdafunction can only run for max 15 mins, therefore we need
AWS Batchfor long running SQL workloads. The lambda function will trigger AWS Batch to do the work.
As AWS Batch provisions and manages resources (EC2 + EBS storage) it scales accordingly, so we don't have any resource constraint or space issues. Once the batch processing is done all alllocated infrastructure will be removed, thus saving cost.
Once the batch processing is over the data can be pushed to AWS
RDSfor storage which can be in multi-AZ deployment having synchronous backup.
2. LIMIT key partitioning
As the records of MEM table containing the consolidated member data of our application was getting loaded with new records we had to manually partition the UNLOAD jobs based on
MEM_ID field as the LIMIT key. This helped avoid some jobs consuming more resources than others.
Example - UNLOAD JOB 1:CONTROL CARD DATASET
LIMIT key set as MEM_ID.
- Before - Job 1 was doing all the unloading and consuming more resources.
- After - Jobs are evenly distributed to unload records based on new Limit keys.
Amazon DynamoDB stores data in partitions. Partition management is handled entirely by DynamoDB—you never have to manage partitions yourself.
3. CPU - soft capping
There were several instances when the batch jobs ran slower and consumed higher
Elapsed times. On further investigation we found that the mainframe performance & capacity team, had enabled softcapping for our application jobs without informing us. This was done to provision more
CPU time to other application jobs running in the same LPAR, and sometimes to avoid heating issues.
This did not cause any job
abends, however it delayed the entire process due to slow running jobs, thus causing SLA breach due lack of compute resources.
With cloud services we can auto-scale our infrastructure and automate capacity management, thus avoiding situations like these.
4. Fragmented regions
The 70 odd batch jobs were taking significantly longer unload times in pre-production regions. This is due to senior DBA's forgetting to perform REORG on the
base mainframe tables in those regions, after production data migration. There was also discrepancies in the
index definitions for tables in those regions compared to production tables.
This scenario can be avoided in the cloud with help of uniform
Cloudformationtemplates deploying infrastructure across regions. IAC.
5. Access restrictions
While working for the offshore account, there were several occasions when business wanted me to query unrestricted production tables or deploy JCL changes in production to fix certain issues. However my TSO ID did not have access to the tools (SPUFI, Changeman) needed to do that. This was enabled across entire offshore account. So I had to wait for my onshore counterpart to perform those changes, causing delays.
Granular IAM policies for READ/WRITE access to a resource
ARNcan be attached to a specific user, in a group.
The mainframe batch jobs across regions were scheduled using a ASG-ZEKE tool. Schedule changes across regions were handled by a separate contractor and we had to traige with them in case of any minor changes.
I found CI/CD pipelines and job scheduling easy to setup in Github actions and the way it implements workflows for inter-dependant jobs is easy to track.
Although mainframes are extremely robust machines, they do require maintainence. If any jobs are running during those timelines they are cancelled(S222 abend) by datacenter, due to LPAR being taken down. So we had to analyse the list of high impacted job prematurely and put them on hold. This was a tedious process as maintainence timelines vary. Also we were unable to query tables needed for resolving incidents or fix defects during those hours.
With a distributed infrastructure across several regions, there is better hardware backup incase of failures or outage.
- While working as a junior DBA in one of my earlier projects, I saw the challenges among developers and the DBA's. The DBA's usually have DDL,DML scripts which need to be managed on regular basis after triaging with developers.
- In a RDBMS like DB2 an admin usually needs to configure the database, define the schema, tablespace, field data type and length while creating a table and then grant access to the application developers.
Sample DB2 SQL
create table MYSCHEMA.MYTABLE ( MEM_ID CHAR(5) NOT NULL, MEM_NAME CHAR(50) NOT NULL, EMAIL CHAR(8) NOT NULL, UPD_DTE DATE NOT NULL, PRIMARY KEY (MEM_ID) ) in database DB1234.TS1234; grant INSERT,UPDATE,SELECT,DELETE ON MYSCHEMA.MYTABLE TO GROUP; grant INSERT,UPDATE,SELECT,DELETE ON MYSCHEMA.MYTABLE TO USER;
Selectqueries written must be optimized for better performance. This requires additional knowledge of the DB2 optimizer.
A NoSQL database like DynamoDB works primarily on key value pair, requires only a primary key and doesn’t require a schema to create a table. It doesn't have any administrative burdens of operating and scaling a distributed database. The table can be created using Boto 3 in python or SAM template. There is no need for separate DBA's to perform these tasks. However, DynamoDB is not a RDBMS and does NOT support SQL queries like them.
SELECT * FROM MYTABLE WHERE EMAIL ='email@example.com';
To perform a similar operation in
DynamoDBwe can perform table scan on the entire table, which is costly and not recommended. The other way is to define a
hashkey on mem_id attribute, and create a
GSI- Global secondary Index on email attribute, in SAM template.
Then we can search the desired rows with Boto3 in python as per application requirement.
I got these concepts from here and here.
NoSQL needs a different approach compared to RDBMS. There is also traditional RDBMS support in the cloud using RDS.
9. User Interface
The ISPF interface can be challenging to navigate for repetative tasks. However VSCode editor support was recently added for Z/OS Mainframes. Z/OS only allows a single session for a TSO userID, so we cannot open multiple sessions across different terminals/devices.
I can use the editor of my choice for development. I can pass my
IAMcredentials as environment parameters while deploying code through a virtual machine, which blew my mind.
Top comments (0)