DEV Community: Jaganadh Gopinadhan

Remembering Kenneth Gonsalves (KG/lawgon)

Jaganadh Gopinadhan — Fri, 05 Aug 2022 04:34:00 +0000

Kenneth Gonsalves(KG alias lawgon) was one of my mentors. He was a unique personality and always helped me sharpen my Python programming skills. Along with him, I conducted more than a dozen Python workshops in the Coimbatore region of Tamil Nadu, India. It was shocking news when I received the call on Aug 3rd, 2012, about his sudden demise. He taught me many skills, such as Open Source Mapping with Open Street Mapping and how to work in Open Source Community. He was instrumental in most of my Open Source endeavors, such as reviving the Coimbatore Linux Users Group and Chennai Python Meetings (2008-2009). He played a crucial role in registering the Indian Python Software Society and working with many brilliant minds in the Indian Python Software Developer community in India to start the Indian Python Conference. If Kenneth was here today, he might have comments on today's conference formats, etc., but would have been very happy to see the growth. In 2022, I was attending training on ‘Think Like a Lawyer,’ and I could recollect advice from Kenneth on many scenarios we were trying to tackle. Any open source enthusiast who joined and posted one e-mail to ILUG-Chennai (Indian Linux Users Group – Chennai) might not have missed at least a reminder about how to write and respond to e-mails in forums. I followed it for most of the enterprise journey and eventually surrendered to the flow. Dropping the ball on continuing the ILUG CBE is still my greatest regret; after Kenneth’s time, we moved to various parts of the country and became busy. Still have wild thoughts about reviving it.

His passion for Golf made him a referee, and he developed Python Django-based software for managing Golf Play/Events. The software was in use in Ooty and Coimbatore Golf club. The way he mainlined documentation for the software was fantastic. People with less idea bout how the software is created were able to install, configure and troubleshoot it. I was called once to Coimbatore Golf club to perform the re-install, and it was a quick task as he kept a note for the computer operator to do it. The software is still available in his GitHub repo; not sure anybody ever updated it. He was a well-known contributor to the Django Python community.

During his tenure at MIT-College Chennai, AU-KBC, and NRC-FOSS days, he mapped the campus with his GPS and Phone OSM software. The details on the campus were 100% accurate and up-to-date in the Open Street Maps. It was a precious work with precision.

He was a big fan of Fedora Linux and a critic of RedHat. When I acquired my first laptop in 2008, I installed Fedora following his path and advice (I donated the computer last year to GoodWill with Fedora 8 running!). We had the same issue the BSNL 3G Dongle never connects in a new Fedora Version. Compile re-compile IRC chats, RedHat criticism, and victory.

Considering all his contributions, the Python Software Foundation (PSF) recognized him with a Community Service Award (posthumous). There was the best speaker award in the name of Kenneth at the Indian Python Software Conference.

He introduced my friends and me to many well-known figures in India. He was happy to attach me as a teaching aid if there were any Python training programs in and around Coimbatore. He was very prompt in ensuring that I was remunerated along with him. The experience of training with him helped me get independent training programs and overcome some of my financial troubles during my Coimbatore days. The similarity was that we both came from a non-technical background and liked Open Source and Python programmers. I was focused on Analytics and Machine Learning, and KG was a master in many subject areas. Once there was a question on the ILUG-Chennai mailing list about suing Open Source for the benefits of Groundnut Farming. Somebody questioned the question, ‘what is the relevance here?’ KG’s response was both have kernel! One of the longest threads I remembered during COVID-19 was about Raspberri-PI and Public Urine Testing (Yes! The thread was much before IoT and analytics were hot topics).

It has been ten years without him for the people who know him, worked with him, close affiliates, and family. The knowledge and experience he shared are precious. He is my mentor and guide and believes he is with us. Still, I read your articles from Linux For You - https://www.opensourceforu.com/author/kenneth-gonsalves/.

Nanofabrication (Semiconductor) Wafer Thickness Prediction Data

Jaganadh Gopinadhan — Fri, 25 Feb 2022 04:02:36 +0000

Open Data Sets in Micro-nano fabrication Part – II

Introduction

The manufacturing industry is always thriving on solving interesting problems with Data and Analytics solutions. Measuring the critical dimension of materials under a product line is one such problem. The problem becomes interesting when the measurements are accurate and minute in nm (nanometer) scales. The semiconductor industry has such a problem measuring thin film thickness during the manufacturing cycle. It is a domain-centric and Intellectual Property(IP) driven area due to the competitive nature of the industry. There are proven solutions available, and known industry leaders are there in the space like KLA and Filmetrics. After many decades it is the first time a large open data-set to solve the problem by the Data Science Community. There was no open data-set available in the industry, which was not an innovation blocker. There are enough research materials produced by academia and industry. It was opened up. The data will be beneficial in training Process Engineers and Data Analysts in Semiconductor industry problem-solving.

Nanofabrication and Thickness

Semiconductors are the backbones of modern digital infrastructure. The very industry is powered by fundamental science such as research on material sciences, physics, chemistry, and mechanics. A fair understanding of this fundamental science is required in solving Nanofabrication/Semiconductor AI/ML problems (the industry knowledge). The due natural course of evolution Information Technology (IT), Data and Analytics, and Artificial Intelligence started complementing the field and vice versa.

The process of martial modification is similar to the 'edit distance' concept in Text processing. Materials are selectively added, removed, or modified to create minute structures under controlled environments. Measuring various materials' thickness is critical to improving the manufacturing yield during each stage of this sequence. The thickness we discuss here is <1 nm to 100 mm (thickness of human hair) thick. There are many processes in the Nanofabrication industry, such as Physical Vapor Deposition (PVD or Sputtering)[4], Etching, spin coating, etc. To achieve the target functionality of a semiconductor chip, maintaining accurate measurements (in nm)are a strict requirement. The industry refers to this under the blanket problem 'Yield Improvement .' Once again, AI or ML alone is not the silver bullet to solve the problem; a harmonic collaboration of Material Scientists, Process Engineers, and Data Analytics professionals is required.

One of the key concepts from Physics to refresh to understand the data is reflectance. Reflectance is the fraction of incident light reflected from a surface, and it is an intrinsic property of thin films[5].

Image 1- Reflection and refraction of light at a boundary between two media[5].

The concept is very critical in understanding the thickness data discussed here. Exemplary details of the reflectance are discussed in reference [5] and the metrology guide by Filmetrics[6], which is worth reading for domain understanding. Optical spectrum analysis is one widely used method to measure thin film thickness. Domain knowledge, computing resources, and subject matter expertise are required to perform the operations. If a subject matter expert in the loop process is adopted, the measurement process is relatively time-consuming.

Dacon Thin Film Thickness Data

The Dacon [1], a Korean data science competition platform like Kaggle, launched a 'Semiconductor think film thickness analysis contest' [2]. Considering all the IP-centric nature, they abstracted the data preserving the nature of the problem. The data is available on the Dacon competition page, subject to terms and conditions[2].

Format

The data was part of a competition, and the organizers provided data in two comma-separated value (CSV) files. The first file was training data with four target thickness measurements and test data without the target. The training file consists of 810k observations with 230 attributes, out of which four attributes are the thickness measurement. The test data-set which is meant for final submission is only 1k records.

Data Attributes

The first four attributes measure the thickness of the four layers layer_1 through 4. The unit of measurement is nm in scale. The materials under reference for this measurement are Si3N4 (Silicon Nitrate), SiO2 (Silicon Dioxide), Si3N4 (Silicon Nitride), and SiO2. The rest of the attributes are masked values to protect IP. These values are expressed as wavenumber (reciprocal of wavelength). The column names are 0~255, and it is between 285 to 800 nm. A plot of the random record is provided below. This plot represents reflectance in y, and x represents the values 0~255 in the data.

Image 2- Thickness data sample.

The Data Context

It is always better to understand the context of data, such as how it is generated measured is always essential in AI/ML experiments. In this process data context, we are trying to predict the measurements of four layers. In processes condition, the bottom layer is always Si (Silicon) and Air in the top layer. Let us find the theoretical reflectance from the four layers to understand this better. I will be using the tool available from Filmetrics [3] to get the theoretical reference range.

Image 3- Theoretical reflectance range for the Dacon data record data shown above

Now we can spot the similarity between the data. In my original plot, I reversed the 0~255 after referring to the Filmetrics reference plot. Since this data is the refraction of light, it is essential to know the angle of incidence. The angle of incidence is 0 degrees in the data.

Observation

One of the interesting observations about the data after checking the Filmetrics plot was a reversal. It may not be necessary to reverse the data for the ML experiment. The Dacon data was changed as part of the de-identification process. To confirm the same, I plotted more than 25 randomly drawn samples from the training data, compared with the Filmetrics plot.

Missing Information

As part of IP protection, the team provided necessary details only. A detailed study and additional modeling information, such as the process information and machinery used to collect the data, etc., might have added more context to the data. A Machine Learning professional will access such minute context details in a real-world situation.

Predicting Thin Film Thickness

After we understand the data, it is time for building some models. In this data scenario, there is some room for feature engineering. I am leaving this for the larger reader community as I intend to introduce the data and domain. I created a reference model as a starter and provided references to the GitHub repository for other implementations.

We use the type of regression strategy called Multi-Output Regression [7]. The problem is unique as it is a regression case but predicts multiple targets. Both classical and Deep Learning methodologies are very effective in solving multi-output regression. In my reference implementation, I used the scikit-learn(sklearn) MultiOutPutRegression API along with the XGBoost library.

The notebook is available at - https://github.com/jaganadhg/waferthickness

If you are interested in solutions by some of the solutions by competitions participants, please refer to the following GitHub repositories.

https://github.com/YoonSungLee/DACON-semiconductor-competition

https://github.com/pjhsk113/DACON-Semiconductor-Thinfilm-Analysis

https://github.com/popo97kr/DACON_semiconductor

I am looking forward to exciting findings and research papers from the data.

Acknowledgment

I am grateful to the AWS Community Builder Program for providing SageMaker and AWS credits. I used AWS SageMaker for Exploratory Data Analysis and Machine Learning experiments. I used Google Translate (from Korean to English) to understand the data and studied many community posts in the Dacon competition forum. Some of the graphing insights are derived from the clarifications provided by the Dacon team and a user' dodo'. I acknowledge the knowledge sharing by the Dacon user community, which helped me create the content.

Competing Interests

The authors declare that no proprietary information related to the authors, affiliated company, or its approach, methodologies, and IPR is discussed in these notes. The authors declare that they have no competing interests.

How to Cite

[*] Jaganadh Gopinadhan, Nanofabrication (Semiconductor) Wafer Thickness Prediction Data, Open Data Sets in Micro-nano fabrication Part – II.

References

[1] https://dacon.io/en

[2] https://dacon.io/competitions/official/235554/overview/description/

[3] https://www.filmetrics.com/reflectance-calculator

[4] https://www.sciencedirect.com/topics/materials-science/sputter-deposition

[5] Reflectance in Thin Films - https://materion.com/-/media/files/advanced-materials-group/me/technicalpapers/reflectance-in-thin-films_all.pdf

[6] Thin Film Measurements - https://files.filmetrics.com/pdf/Filmetrics%20Tutorial%20-%20Thickness%20Metrology%20Guide%20v3N.pdf

[7] Machine learning Refined - https://www.cambridge.org/highereducation/books/machine-learning-refined/0A64B2370C2F7CE3ACF535835E9D7955#overview

Enterprise Machine Learning Best practices for AWS SageMaker

Jaganadh Gopinadhan — Tue, 07 Sep 2021 05:23:36 +0000

Most enterprises prefer cloud Data Science platforms. AWS SageMaker is an industry leader, and analysts recommended the Cloud Data Science platform. The platform offers state-of-the-art Machine Learning components, integrations, and MLOps. While enterprises are adopting SageMaker, it is essential to train our Data Scientists and ML Engineers on best practices. These best practices will help enterprises protect the data in motion (data in training) during the training phase and measure cost and ROI from the Data Science process localized to experiments and projects. This note will explore various SageMaker SDK options to secure, manage, and track experiments and inferences.

Tagging Training and Models

Tagging is a critical piece in public cloud infrastructure cost management, security, and resource management. As Data Scientists or ML Engineers tag may not sound something significant. But it is better to start adopting in the enterprise settings. Eventually, your team would be able to generate meaningful insights about the cost of training and inference from such a tag process. The team may have to work close to the Cloud Infrastructure and Operations (I&O) team to achieve this process.

A tag is a set of key-value pairs like JSON or Python dictionaries. A sample tag may look like

Project: Our Awsome ML Magic
Sponsor: The Cool Manager
Project Lead: The AI Geek
Project Name: The Magic Wand
Cost Center: TheSuperCreditCardwithNoLimit
Contact: we@ourcoolcompany.ai

Let's see how to achieve tagging in a training job and deploy a model. First, we have to convert our tags to a Python dictionary or JSON.

my_tags = {'Project' : 'Our Awsome ML Magic',
'Sponsor' : 'The Cool Manager',
'Project Lead' : 'The AI Geek',
'Project Name' : 'The Magic Wand',
'Cost Center' : 'TheSuperCreditCardwithNoLimit',
'Contact' : 'we@ourcoolcompany.ai'}

Now assign the tag to the tag parameter in your estimator API.

est = sagemaker.estimator.Estimator(
    container_image_uri,
    aws_role,
    train_instance_count=1,
    train_instance_type='ml.m5.xlarge',
    base_job_name="the-cool-ml-pipe",
    hyperparameters=hyperparameters,
    use_spot_instances=True,
    max_run=max_run,
    max_wait=max_wait,
    tags=my_tags,
)

This process will add the tags to any AWS artifacts created by the training job. A training process spins a VM/container, and the compute is charged to the AWS account. With the help of tags, we could isolate the charge for each model experiment as well. Once the model is trained, we could use the deploy API and add the tag in the deployed model. Yes! the deploy API has tags parameter to achieve the same.

Security Groups and Subnets

Enterprise data and process is precious and worth protecting in all means. While creating SageMaker experiments, it is necessary to specify subnets and security groups in the configuration to skip any potential ravages hidden in the mysterious world. A security group filters any incoming and outgoing traffic in AWS resources. The estimator API accepts the security groups and subnets as a list. Irrespective of the nature of data sensitivity, it is advised to set the subnets and security_group_ids. By default, AWS runs your training job in a separate VPC if none is specified. The steps we discussed are an additional measure of security.

Encrypt Container Traffic

When using advanced algorithms such as Deep Learning, distributed training is inevitable. During distributed training inter container, traffic will be there. It is better to encrypt the traffic; remember, it can slightly delay your training process. The cost and delay impacted by the encryption are worth the security of your data. The developer needs to set the value of encrypt_inter_container_traffic as True (default, it is False).

Network Isolation of Containers

During the training or inference, we may not need access to the internet for any data. It is advised to store the data in an appropriate data store and reference it in the training script. By doing so, we can avoid any internet traffic to our training containers. In the Estimator API, we can specify the enable_network_isolation parameter as True.

Encryption

If the S3 buckets are encrypted for additional security with managed keys, we have to specify the keys in the Estimator API. The volume_kms_key and output_kms_key are the parameters to set. It is better to coordinate with your cloud team for policies and key usage. Every company has key management policies. Always remember never to expose your keys to open the internet!

Cost Optimization

SageMaker provides managed spot instances for warm start training instances and costs optimization. Enabling the spot instances is good practice as training is an iterative process; we can save some $$ for our SOTA models ;-). Details and examples of how to use spot instances in SageMaker is available at https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html

MLOps Innovation

These features may or may not be resonating in the minds of Data Scientists or ML Engineers as it is more related to Infrastructure. An innovative MLOps team can create automation or homegrown libraries for supporting Data Scientists and ML Engineers in simplifying the boring task.

Happy Hacking!