DEV Community: Ata Seren

Creating a DevSecOps pipeline with Jenkins — Part 1

Ata Seren — Sun, 17 Mar 2024 13:22:02 +0000

Hello everyone. I’m Ata, a computer science graduate and currently interested in cybersecurity. In the fast-paced world of cybersecurity, staying ahead of potential threats is crucial. As a junior cybersecurity engineer with a specific interest in DevSecOps, I am excited to share my journey in creating a robust DevSecOps pipeline by using Jenkins and various tools.

I will share my journey in 3 parts since there is too much material and reading them in a single story may be difficult and take a lot of time. In this part, I will talk about DevSecOps pipelines, Jenkins and its installation, SAST and some details about them.

What are DevSecOps Pipelines?

DevSecOps pipelines are similar to DevOps pipelines. They are like assembly lines for building software, but with a big focus on security. These pipelines automate different tasks, like writing code, testing it, putting it into action, and keeping an eye on it afterward. At each step, they also check for security problems, making sure everything stays safe from the start to the end of the process. This way, everyone involved works together to make sure the software is secure right from the beginning.

Jenkins

Jenkins is an open-source automation server widely used for automating various aspects of the software development process, including building, testing, and deploying software. It provides a platform for continuous integration (CI) and continuous delivery (CD), allowing developers to integrate code changes into a shared repository frequently.

Download and Installation

First of all, let’s setup a Jenkins instance on our machine. Since it has a simple installation and using a container can have a negative impact on performance or storage, I installed Jenkins directly on my virtual machine. I’m using Ubuntu 22.04 on VMware Workstation Pro with 16GB of RAM, 100GB of storage and 8 processor cores. These specs are definitely not a limit or requirement.

For the installation of Jenkins I simply followed the Ubuntu/Debian section under Linux section in this link: https://www.jenkins.io/doc/book/installing/. It takes you through the installation of Jenkins itself and Java since Jenkins requires Java to run. You can also find minimal and recommended specs, installation guide, troubleshooting and many other useful information about Jenkins. It is very easy to follow but let me give you a spoiler:

Supposed that you have Java installed in your machine, simply run this command:

sudo wget -O /usr/share/keyrings/jenkins-keyring.asc \\  
 https://pkg.jenkins.io/debian/jenkins.io-2023.key  
echo deb \[signed-by=/usr/share/keyrings/jenkins-keyring.asc\] \\  
 https://pkg.jenkins.io/debian binary/ | sudo tee \\  
 /etc/apt/sources.list.d/jenkins.list > /dev/null  
sudo apt-get update  
sudo apt-get install jenkins

At the end of the installation, you can see that this command has:

Setup Jenkins as a daemon launched on start,
Created a ‘jenkins’ user to run this service,
Directed console log output to systemd-journald,
Populated /lib/systemd/system/jenkins.service with configuration parameters for the launch, e.g JENKINS_HOME and most importantly,
Set Jenkins to listen on port 8080.

Configuration

After the installation is complete, go to the Jenkins on your web browser. If you didn’t do any custom configuration, it is probably in http://localhost:8080. This page will greet you:

From the given path, receive the initial password and access Jenkins. You may need root access to reach the path.

Choose “Install suggested plugins”. We will install required plugins for the steps of our pipeline one by one, from the “Available plugins” section that will be mentioned:

After the plugin installation, you are asked to create an admin user for Jenkins. You can choose “Skip and continue as admin” option but you have to use the initial password to access Jenkins. Therefore, I suggest you to create a user. You can use the same credentials as your machine’s credentials for easiness:

You can directly choose “Save and Finish” option since our Jenkins is set up locally instead of on a cloud system and we will conduct all scans locally:

If you are seeing this page, congrats! You have successfully installed Jenkins to your system:

Creating a Pipeline

By clicking on “New Item”, you will get to this page where Jenkins provides many options. On this page, we will choose “Pipeline” option. In this option, we will create our pipeline from scratch, by coding a Jenkinsfile. “Freestyle project” option provides a better UI and easier way to create a pipeline. However, some steps do not fit easily in Freestyle and once you learn how to create a pipeline from scratch, you will automatically learn how to use other options since this is the base of Jenkins pipelines:

On the opened page, you will see the general options for your project. You can leave them like that for now since our purpose is to learn the basics of a security-based scan and we will test a project that is intentionally vulnerable and isn’t a part of a continuing development process. This is the section which we will create the pipeline from the scratch:

Step-by-step Pipeline Development

Now, let’s have an overview of stages we will use in our security scanning pipeline:

Checkout: Cloning the project from a Git repository (this stage may vary according to our way to access the project).
Build: Building the project since some steps require a built project.
SAST: Static Application Security Testing tests upon the code and built project.
Dependency Check: Performs a scan on dependencies of a project and reports if there are any vulnerabilities on them.
SBOM: Software Bill Of Materials is a list of components in a piece of software. SBOMs are useful to determine used technologies in a project and can be used for additional security scans.
SCA: Software Composition Analysis can be used to automate the identification of vulnerabilities in entire container images, packaged binary files and source code.
Git Secrets Detection: Can prevent accidental code-commit containing a secret.
Container Security: If there is a container of the project, you can test everything on the container.
DAST: Dynamic Application Security Testing approaches security from the outside. It requires applications be fully compiled and operational to identify network, system and OS vulnerabilities.

These are our stages that we will implement in our pipeline. Jenkins pipelines use a file type called _Jenkinsfile_. It’s a different file type that you probably used before but it has a simple syntax. For example, this is a single stage Hello World pipeline script:

pipeline {  
    agent any  

    stages {  
        stage('Hello') {  
            steps {  
                echo 'Hello World'  
            }  
        }  
    }  
}

Jenkins allows you to configure many settings according to your needs. However, these parts are the only required ones for a pipeline. Throughout the development of the pipeline, we will use additional parts and syntax rules for tools we will run in stages. For the additional settings and syntax rules you can visit https://www.jenkins.io/doc/book/pipeline/syntax/.

Tip: To add a setting, feature or section, you can use Pipeline Syntax located under Pipeline section:

In Pipeline Syntax, there are useful information about the syntax and most importantly, 2 tools that will assist you: Snippet Generator and Declarative Directive Generator. These tools help you to generate a Jenkinsfile code according your needs and given parameters.

During my learning process, I ran a build for every step to make sure that the code is correct in terms of syntax and configuration. I will follow the same way throughout the note to both show it to you better and remind myself the process.

Checkout

First of all, let’s pull our project we want to scan from a Git repository. For our example project, I chose **Vulnado — Intentionally Vulnerable Java Application**. Here is its link: https://github.com/ScaleSec/vulnado

Every time we start the pipeline or a trigger such as a push or a PR to the repository starts the pipeline, the project will be pulled from the repository. We will use **Git** to perform the pull. Jenkins plugin for Git is installed by default. However, we still need to have Git installed in our machine. To do this, simply run command and install Git binary:

sudo apt-get install git

After this install, you need to invoke the binary in your Jenkins. For this, you can simply add this command to the pipeline:

git ‘https://github.com/ScaleSec/vulnado.git'

This command uses Git plugin for Jenkins to interact with the Git binary in our machine. We will usually interact with tools and programs via such plugins to have a stable and efficient pipeline.

This is our code at the end of this step:

pipeline {  
    agent any  

    stages {  
        stage('Checkout') {  
            steps {  
                git 'https://github.com/ScaleSec/vulnado.git'  
            }  
        }  
    }  
}

Note that we created a stage under “stages” and in this stage, we added the command under “steps”. We will follow the same approach for each stage and commands we need to run.

This is the pipeline dashboard after our first build:

Build

In DevOps and DevSecOps pipelines, building is a required stage to see if the changes on the code didn’t break the program. If project is successfully built, testing stage will start. If the tests are passed, project is finally deployed with a deployment stage. In this DevSecOps pipeline, we are aiming to perform security scans and find the vulnerabilities. In other words, we are doing the testing part of a traditional pipeline. Therefore, we don’t need to bother with building and deploying, yet.

Almost every tool in a DevSecOps pipeline requires project to be built to perform efficient scans on them. Some of them build it themselves, some of them don’t. Therefore, I usually add a stage to build the project since it can be useful and it is easy to add it to the pipeline.

This project is a Maven project. Therefore, I use a command specific to it to build my project. You can use a different command according to your project to be used in the pipeline:

mvn clean package

Note that an install may be required to build your project. For example, I needed to install Java and Maven on my machine since the project I scan is a Maven project.

This is the simple command to run on my terminal to build my project and that’s what we should do in the pipeline too. To run a shell command, simply put single quotes at the start and end of the command and add “sh” at the beginning of the command. This will be the step to run in Jenkins pipeline: sh ‘mvn clean package’\

This is the final code and result of running the pipeline:

pipeline {  
    agent any  

    stages {  
        stage('Checkout') {  
            steps {  
                git 'https://github.com/ScaleSec/vulnado.git'  
            }  
        }  
        stage('Build') {  
            steps {  
                sh 'mvn clean package'  
            }  
        }  
    }  
}

SAST

Here comes the hard part. Previous steps did not require complex tools and processes. We just pulled and build the project. But from now on, we need to use various tools, install them on our machine and make configurations on both the tools and Jenkins itself.

For the SAST stage, I used SonarQube tool. SonarQube is an open-source platform developed by SonarSource for continuous inspection of code quality to perform automatic reviews with static analysis of code to detect bugs and code smells on more than 30 programming languages. I preferred SonarQube instead of other SAST tools because it has a detailed documentation and plugins about integration with Jenkins and SonarQube works with Java projects pretty well. Of course you can similar multi-language-supported tools such as Semgrep or language-specific tools such as Bandit.

Let’s start with the simplest one. We need to download the plugin on Jenkins for SonarQube to connect SonarQube and Jenkins. On the main dashboard of Jenkins, go to Manage Jenkins > Plugins > Available plugins and type “sonarqube”. SonarQube Scanner plugin should appear on the top:

Mark the box next to the plugin and install the plugin. On the installation page, mark the box “Restart Jenkins when installation is complete and no jobs are running” since you need to restart Jenkins eventually to make the plugin available.

Obviously, we must install an instance of SonarQube. You can use a machine on a cloud provider if you want too but for simplicity, we will install a local instance. You can both use a Docker image or the zip file to run SonarQube. Both ways are simple to perform, compatible with Jenkins and this paper. Here is the link and steps for you to follow for installation: https://docs.sonarsource.com/sonarqube/latest/try-out-sonarqube/. The steps in this page is to run a simple instance of SonarQube on a Docker container. If you are curious about it, you can follow the zip file version. Both of them have the same abilities, I just didn’t want to bore you with specific volume and network settings.

You can understand that installation is successful and ready to be used with Jenkins by trying to access SonarQube by accessing URL http://localhost:9000/ (This is the default URL for SonarQube if you didn’t make any custom configuration). This is the first page that will greet you. Your default username and password is “admin”. It will ask you to change your password to continue.

This is the dashboard of SonarQube. Since we don’t have any projects created, it provides us options to choose a DevOps platform. Usually, GitHub is selected but we won’t follow that path. Because choosing GitHub requires us to have a GitHub App. GitHub Apps are tools that extend GitHub’s functionality like opening issues, comment on pull requests, and manage projects. They can also do things outside of GitHub based on events that happen on GitHub:

Using such app may be useful for organizations or frequently developed projects. However, this story’s scope is performing a security test on a project and creation of a GitHub App and integrating it to both SonarQube and Jenkins is difficult and unnecessary for a testing pipeline. Because of this, I want you to choose “Create a local project”. Then, give a name to your project. A key will be suggested to you but any value is okay.

After this page, you can simply choose “Use the global setting” and create your project. It will ask you to choose an analysis method. Choose “With Jenkins”.

In the next page, choose GitHub for “Select your DevOps platform” and it will provide you some steps. We completed some of them. But we didn’t complete some steps about Jenkins and SonarQube integration that are not mentioned here.

For the steps of Jenkins integration, there is a document in SonarQube’s website. However, in my opinion SonarQube documentations are not enough and even sometimes, wrong. Therefore, I will show you the correct steps.

On your Jenkins dashboard, go to Manage Jenkins > Credentials. In this page, click on System > Global credentials (unrestricted) > Add credentials. In here:

Kind must be Secret Text
Scope must be Global
Secret must be token generated on SonarQube. You can easily generate it at User > My Account > Security in SonarQube. Generate a Global Analysis Token and copy and paste it here.

Now, from the Jenkins dashboard again, go to Manage Jenkins > System > SonarQube servers. In here, enter a name to your installation, server URL (which is http://localhost:9000 by default) and server authentication token. This is the credential we just added. You should be able to see it in the dropdown menu. Choose it and save:

Now, the final part. On SonarQube, you can see a script that you can add to your pipeline. However, it needs some adjustments and a fix to work properly. First of all we don’t need def mvn = tool ‘Default Maven’;\ part because our Maven is on the PATH. Because of this, we should change the shell command to: sh “mvn clean verify sonar:sonar -Dsonar.projectKey=vulnado -Dsonar.projectName=’vulnado’”\ Finally, we must include an installation name in the script. It is not mentioned on SonarQube but it is necessary. Installation name is the name you entered in previous step. For example, my installation name is “sonar-local” and here is the part of the code:

withSonarQubeEnv(installationName: ‘sonar-local’)

pipeline {  
    agent any  

    stages {  
        stage('Checkout') {  
            steps {  
                git 'https://github.com/ScaleSec/vulnado.git'  
            }  
        }  
        stage('Build') {  
            steps {  
                sh 'mvn clean package'  
            }  
        }  
        stage('SonarQube Analysis') {  
            steps{  
                withSonarQubeEnv(installationName: 'sonar-local') {  
                  sh "mvn clean verify sonar:sonar -Dsonar.projectKey=vulnado -Dsonar.projectName='vulnado'"  
                }  
            }  
        }  
    }  
}

This is the result of the build:

The “SonarQube” button simply redirects you to your SonarQube URL. This is the SonarQube page after SAST stage:

Well, this is the end of Part 1. I will upload Part 2 ASAP. In some parts, I didn’t add images to keep the story short. However, I’d love to hear your comments about the format and content of this story, so I can improve the following parts.

Thanks for reading. I hope it helps!

LeetCode Study Plan: 30 Days of Pandas

Ata Seren — Sat, 16 Mar 2024 21:37:20 +0000

Hello everyone. I’m Ata, a computer science graduate and currently interested in cybersecurity. I haven’t used LeetCode a lot since college. But recently, I wanted to hone my coding skills and learn new concepts. To achieve this, I wanted to use “study plans” of LeetCode.

What is Study Plan?

LeetCode Study Plans are plans that consist of LeetCode problems scheduled and categorized. These study plans and problems in them can be specific to some areas such as JavaScript, SQL, etc. or problems that are chosen for training for code interviews. These plans are also split into time schedules to ease the solving process.

30 Days of Pandas Study Plan

This study plan on LeetCode covers the essential topics that are often asked in Pandas interviews. It consists of 32 questions. Therefore, you can schedule the questions for every day of a month. I used Pandas few times, mostly for machine learning projects. Now, I want to delve into more features of Pandas and improve my knowledge on another computer science topic.

From the thumbnail, you can see that I only solved 28 problems because other 4 questions were available to only LeetCode Premium subscribers. If I subscribe to it one day, I’ll add those questions, too.

What Will You Find In This Post? (TLDR Part)

During my solving process, I took some notes to reinforce my understanding of Pandas functions and their capabilities. This story is written according to these notes.

It isn’t much but I wanted to share these notes because I want to write more stories about different topics and I wanted to make a small step with these. Also, I studied the problems while writing this story. I hope it will be beneficial for people who are interested in Pandas or coding problems.

In this story, I’ll name the questions, provide my answers, and explain the functions used in the solutions.

Note: I won’t delve into the all details of the problems, as they are available in a better format and with examples on LeetCode.

Big Countries

Q: A country is big if:

it has an area of at least three million (i.e., 3000000 km²), or
it has a population of at least twenty-five million (i.e., 25000000). Write a solution to find the name, population, and area of the big countries. Return the result table in any order.

def big_countries(world: pd.DataFrame) -> pd.DataFrame:
    return world.loc[(world['area'] >= 3000000) | (world['population'] >= 25000000), ['name', 'population', 'area']]

In this question, df.loc() function is used to locate the desired entries. The most important part is the conditional part used in the function. df.loc function is very useful to apply conditions to a search process among the entries.

Recyclable and Low Fat Products

Q: Write a solution to find the IDs of products that are both low fat and recyclable. Return the result table in any order.

def find_products(products: pd.DataFrame) -> pd.DataFrame:
    return products.loc[(products['low_fats'] == 'Y') & (products['recyclable'] == 'Y'), ['product_id']]

Like the previous question, I used conditions in df.loc function.

Customers Who Never Order

Q: Write a solution to find all customers who never order anything. Return the result table in any order.

def find_customers(customers: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    result = customers[~customers['id'].isin(orders['customerId'])]
    return result[['name']].rename(columns={'name': 'Customers'})

This time, we have a new function caller df.isin and a character tilde “~”. df.isin function is used to determine whether each element in the DataFrame is contained in parameters of the function. Tilde character is used to get the complement of the values. In this question, tilde is used to get the complement of values of customers that placed an order.

Article Views I

Q: Write a solution to find all the authors that viewed at least one of their own articles. Return the result table sorted by id in ascending order.

def article_views(views: pd.DataFrame) -> pd.DataFrame:
    result = views.loc[views['author_id'] == views['viewer_id'], ['author_id']].sort_values(['author_id'], ascending = True).drop_duplicates().rename(columns={'author_id': 'id'})
    return result[["id"]]

In this question, other than the simple and usual functions, I used df.sort_values to meet the order requirements of the question. You can also see that I used the chaining method I mentioned in my previous article to make it seem better and save some storage.

Invalid Tweets

Q: Write a solution to find the IDs of the invalid tweets. The tweet is invalid if the number of characters used in the content of the tweet is strictly greater than 15. Return the result table in any order.

def invalid_tweets(tweets: pd.DataFrame) -> pd.DataFrame:
    return tweets.loc[tweets['content'].str.len() > 15, ['tweet_id']][['tweet_id']]

Different than the other similar questions, I used Series.str.len(). This function computes the length of each element in the Series/Index. I used it to meet the question’s requirements.

Calculate Special Bonus

Q: Write a solution to calculate the bonus of each employee. The bonus of an employee is 100% of their salary if the ID of the employee is an odd number and the employee's name does not start with the character 'M'. The bonus of an employee is 0 otherwise. Return the result table ordered by employee_id.

def calculate_special_bonus(employees: pd.DataFrame) -> pd.DataFrame:
    employees['bonus'] = 0

    employees.loc[(employees['employee_id'] % 2 != 0) & (~employees['name'].str.lower().str.startswith('m')), 'bonus'] = employees['salary']

    return employees[['employee_id','bonus']].sort_values('employee_id')

First of all, I set the all bonuses to 0 to change the bonuses that are required to be changed and keep the rest of it with the same value.

In this question, I used a converter function for the first time. Series.str.lower() function makes a string lowercase. I added this function to the chain because one of the test cases had a character ‘m’ instead of ‘M’. Therefore, I wanted to check for both lowercase and uppercase ‘m’. With Series.str.startswith(char), I checked the first character of the employee names and calculated bonuses according to that.

Fix Names in a Table

Q: Write a solution to fix the names so that only the first character is uppercase and the rest are lowercase. Return the result table ordered by user_id.

def fix_names(users: pd.DataFrame) -> pd.DataFrame:
    users.name = users.name.str.lower().str.capitalize()

    return users[['user_id', 'name']].sort_values('user_id')

This is a simple one. First, I converted all words to lowercase with Series.str.lower() and capitalized the first letter with Series.str.capitalize().

Find Users With Valid E-Mails

Q: Write a solution to find the users who have valid emails.

A valid e-mail has a prefix name and a domain where:

The prefix name is a string that may contain letters (upper or lower case), digits, underscore '_', period '.', and/or dash '-'. The prefix name must start with a letter.
The domain is '@leetcode.com'.
Return the result table in any order.

def valid_emails(users: pd.DataFrame) -> pd.DataFrame:
    return users.loc[users['mail'].str.contains(r'^[a-zA-Z][a-zA-Z0-9_.-]*@leetcode\\.com$')]

To be honest, I got some help from ChatGPT 😃. From the question, I realized that I need to check values whether they fit into a pattern. Regular expressions (regex) are the best way to do it efficiently. To do the comparison, I looked for a function and found that Series.str.contains() accepts regex, along with the other types of parameters. After that, I asked ChatGPT to generate a regex to meet the valid email requirements and used it in this function.

Patients With a Condition

Q: Write a solution to find the patient_id, patient_name, and conditions of the patients who have Type I Diabetes. Type I Diabetes always starts with DIAB1 prefix. Return the result table in any order.

def find_patients(patients: pd.DataFrame) -> pd.DataFrame:
    return patients.loc[patients['conditions'].str.contains(r'(^|\\s)DIAB1')]

Again, there is a question with regex. But I realized that I need to use regex after my first submission for this problem. I thought that I just needed to find a single word. Therefore, I simply used "DIAB1" with Series.str.contains(). However, in one of the test cases, there is a word “SADIAB1” that returns true for the function but not the word that question asks for. Therefore, I converted it to regex and added “(^|\s)” part which means that this word is at the beginning or there is a space before that.

Nth Highest Salary

Q: Write a solution to find the nth highest salary from the Employee table. If there is no nth highest salary, return null.

def nth_highest_salary(employee: pd.DataFrame, N: int) -> pd.DataFrame:
    try:
        if N < 1:
            raise ValueError("Manually raising a ValueError")
        salary = employee.drop_duplicates(subset=['salary']).sort_values(by=['salary'], ascending = False).reset_index(drop=True).iloc[N-1]["salary"].astype(int)
        return pd.DataFrame([salary], columns=[f'getNthHighestSalary({N})'])
    except:
       return pd.DataFrame([np.nan], columns=[f'getNthHighestSalary({N})'])

In this question, I used a try-except block for some test cases. There are 2 ways that the question’s test cases can cause an exception: There are entries or high salaries in data frame less than the n value or n value is smaller than 1. I used the try-except block and an if statement to eliminate these possibilities and also used NumPy to enter null value in the returned DataFrame to match the test cases.

In the try-except block, I sorted the entries by ‘salary’ value. Then, I located the Nth highest salary with df.iloc function.

Second Highest Salary

Q: Write a solution to find the second highest salary from the Employee table. If there is no second highest salary, return null (return None in Pandas).

def second_highest_salary(employee: pd.DataFrame) -> pd.DataFrame:
    try:
        salary = employee.drop_duplicates(subset=['salary']).sort_values(by=['salary'], ascending = False).reset_index(drop=True).iloc[1]["salary"].astype(int)
        return pd.DataFrame([salary], columns=['SecondHighestSalary'])
    except:
       return pd.DataFrame([np.nan], columns=['SecondHighestSalary'])

This question is similar to previous one. Instead of arbitrary one, n value is 2 in all cases. Therefore I just need to use the same logic with try-except in case of there are no 2nd highest salary in data frame.

Department Highest Salary

Q: Write a solution to find employees who have the highest salary in each of the departments. Return the result table in any order.

def department_highest_salary(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:
    max_salary_by_department = employee.groupby('departmentId')['salary'].max()

    entries_with_max_salary = employee[employee.apply(lambda x: x['salary'] == max_salary_by_department[x['departmentId']], axis=1)]

    df_merged = pd.merge(entries_with_max_salary, department, left_on='departmentId', right_on='id', how='left')

    columns_order = ['name_y', 'name_x', 'salary']

    return df_merged[columns_order].rename(columns={'name_y': 'Department', 'name_x': 'Employee', 'salary': 'Salary'})

In this question, I used a lambda function which is an anonymous function that we can pass in instantly without defining a name or anything like a full traditional function. First, by using df.groupby() and df.max() function, I gathered the values of highest salaries of each department. Then, by using a lambda function, I matched and merged the salaries, the employees who get that salary and the departments of those employees. I returned the answer from the result of this merge.

Rank Scores

Q: Write a solution to find the rank of the scores. The ranking should be calculated according to the following rules:

The scores should be ranked from the highest to the lowest.
If there is a tie between two scores, both should have the same ranking.
After a tie, the next ranking number should be the next consecutive integer value. In other words, there should be no holes between ranks.
Return the result table ordered by score in descending order.

def order_scores(scores: pd.DataFrame) -> pd.DataFrame:
    scores['rank'] = scores['score'].rank(method = 'dense', ascending = False)
    return scores.sort_values(by = 'score', ascending = False)[['score', 'rank']]

In this question, I used the rank() function to assign ranks to the scores. I used ‘dense’ parameter in the rank() function. ‘Dense’ is like ‘min’, but rank always increases by 1 between groups.

By setting the method parameter to dense and sorting the DataFrame by score in descending order, I achieved the desired ranking.

Delete Duplicate Emails

Q: Write a solution to delete all duplicate emails, keeping only one unique email with the smallest id.

For Pandas users, please note that you are supposed to modify Person in place.

After running your script, the answer shown is the Person table. The driver will first compile and run your piece of code and then show the Person table. The final order of the Person table does not matter.

def delete_duplicate_emails(person: pd.DataFrame) -> None:
    person.sort_values(by=['id'], inplace=True)
    person.drop_duplicates(subset=['email'], inplace=True)

In this question, I sorted the DataFrame by ID to ensure that for duplicate emails, the one with the smallest ID remains. Then, I dropped duplicate emails with df.drop_duplicates(). It keeps only the first occurence, which is the one with the smallest ID. inplace=True parameter makes it modify the DataFrame in place.

Rearrange Products Table

Q: Write a solution to rearrange the Products table so that each row has (product_id, store, price). If a product is not available in a store, do not include a row with that product_idand store combination in the result table. Return the result table in any order.

def rearrange_products_table(products: pd.DataFrame) -> pd.DataFrame:
    return pd.melt(products, id_vars=['product_id'], value_vars=['store1', 'store2', 'store3'], var_name='store', value_name='price').dropna()

This question just requires a simple “melting” process which I mentioned in my previous article.

I used pd.melt() function to reshape the table. This function stacks the store1, store2, and store3columns into a single 'store' column while keeping product_idas an identifier. After reshaping, I dropped any rows with missing prices using df.dropna(), ensuring that only products available in stores are included in the result.

Count Salary Categories

Q: Write a solution to calculate the number of bank accounts for each salary category. The salary categories are:

"Low Salary": All the salaries strictly less than $20000.
"Average Salary": All the salaries in the inclusive range [$20000, $50000].
"High Salary": All the salaries strictly greater than $50000. The result table must contain all three categories. If there are no accounts in a category, return 0. Return the result table in any order.

def count_salary_categories(accounts: pd.DataFrame) -> pd.DataFrame:
    accounts['category'] = 'Low Salary'
    accounts.loc[(accounts['income'] >= 20000) & (accounts['income'] <= 50000), 'category'] = 'Average Salary'
    accounts.loc[accounts['income'] > 50000, 'category'] = 'High Salary'
    accounts = accounts.groupby(by=['category']).size().reset_index(name='accounts_count')
    res = pd.DataFrame({'category':['Low Salary', 'Average Salary', 'High Salary']})
    res = res.merge(accounts, how='left', on='category')
    res.loc[res['accounts_count'].isnull(), 'accounts_count'] = 0
    return res

In this question, I separated the entries into different salary categories. First, I created a column in DataFrame called categoryand added ‘Low Salary’ value for all entries. Then, by using conditions, I changed the ‘category’ values according to the incomevalue. Finally, to meet the desired result, I grouped the entries and reset the index under accounts_countname.

To return, I created a DataFrame in desired format and merged it with accountsDataFrame. After adding 0 value to empty categories, I returned the DataFrame.

If you are looking for a one line solution:

def count_salary_categories(accounts: pd.DataFrame) -> pd.DataFrame:
    return pd.DataFrame({'category':['Low Salary','Average Salary','High Salary'], 'accounts_count':[len(accounts[accounts['income'] < 20000]),
     len(accounts[(accounts['income'] >= 20000) & (accounts['income'] <= 50000)]), len(accounts[accounts['income'] > 50000])]})

Find Total Time Spent by Each Employee

Q: Write a solution to calculate the total time in minutes spent by each employee on each day at the office. Note that within one day, an employee can enter and leave more than once. The time spent in the office for a single entry is out_time - in_time.

Return the result table in any order.

def total_time(employees: pd.DataFrame) -> pd.DataFrame:

    employees['total_time'] = employees['out_time'] - employees['in_time']

    employees = employees.groupby(['event_day', 'emp_id'])['total_time'].sum().reset_index(name = 'total_time')
    return employees.rename(columns = {'event_day' : 'day'})

First, I calculated the time spent by each employee for each entry by subtracting the in_time from the out_time and saved these values under total_time. Then, I grouped the DataFrame by event_dayand emp_idand got the sum of the total time spent by each employee on each day using df.groupby() and df.sum(). Finally, I reset the index and renamed the columns to meet desired format.

Game Play Analysis I

Q: Write a solution to find the first login date for each player. Return the result table in any order.

def game_analysis(activity: pd.DataFrame) -> pd.DataFrame:

    activity = activity.groupby('player_id')['event_date'].min().reset_index()

    return activity.rename(columns={'event_date':'first_login'})

This question was very simple. Instead of integers, I grouped entries according to the minimum value of dates by using df.min(). With a small renaming, I returned the desired DataFrame.

Number of Unique Subjects Taught by Each Teacher

Q: Write a solution to calculate the number of unique subjects each teacher teaches in the university. Return the result table in any order.

def count_unique_subjects(teacher: pd.DataFrame) -> pd.DataFrame:
    teacher.drop_duplicates(subset=["teacher_id", "subject_id"], inplace=True)
    teacher = teacher.groupby(by=["teacher_id"])[['subject_id']].count().reset_index()
    return teacher.rename(columns={"subject_id":"cnt"})

First of all, I dropped the duplicates according to the teacher_id and subject_id values to get rid of the effect of dept_id value because with this value and duplicates are gone, I can group the entries and count them according to different teachers and number of subjects they teach. After this, I returned desired DataFrame with a small renaming.

Classes More Than 5 Students

Q: Write a solution to find all the classes that have at least five students. Return the result table in any order.

def find_classes(courses: pd.DataFrame) -> pd.DataFrame:
    courses = courses.groupby(by=["class"], as_index=False)[["student"]].count()
    courses = courses[courses["student"] >= 5]
    return courses.drop(columns=["student"])

First off all, I grouped the entries according to the classes to get the count of students in them. Then, I selected entries where number of students in the class is equal to or more than 5. After this step, I simply dropped the studentcolumn and found the classes that have at least 5 students.

Customer Placing the Largest Number of Orders

Q: Write a solution to find the customer_number for the customer who has placed the largest number of orders.

The test cases are generated so that exactly one customer will have placed more orders than any other customer.

def largest_orders(orders: pd.DataFrame) -> pd.DataFrame:
    result = orders.groupby(by=["customer_number"], as_index=False)[["order_number"]].count()
    result = result.sort_values(by=["order_number"],ascending=False).reset_index(drop=True)

    return result.drop(columns=["order_number"]).head(1)

First, I grouped the entries by customer_number and counted the number of orders for each customer using df.groupby() and df.count().

Then, I sorted the result in descending order based on the count of orders to identify the customer with the largest number of orders. I used ascending=False parameter to sort in descending order. After this, I reset the index, drop the order_number column since it is not asked and got the first entry of the DataFrame by using df.head(1) function.

At the end of question, there is a follow up: What if more than one customer has the largest number of orders, can you find all the customer_numberin this case?

In such case, I would follow the same procedure and get the first entry with df.head(1) to find the largest number of orders. Then I would use that value to return customer numbers with that number of orders.

Group Sold Products By The Date

Q: Write a solution to find for each date the number of different products sold and their names. The sold products names for each date should be sorted lexicographically. Return the result table ordered by sell_date.

def categorize_products(activities: pd.DataFrame) -> pd.DataFrame:
    activities = activities.groupby(['sell_date'],as_index=False)
    activities = activities.agg({'product':[lambda x: x.nunique(), lambda x: ','.join(sorted(x.unique()))]})
    activities.columns = ['sell_date','num_sold','products']
    return activities.sort_values('sell_date')

In this question, I used 2 lambda functions to apply specific functions to all entries. First, I grouped the activities DataFrame by sell_date using df.groupby() and df.agg() to apply multiple aggregation functions. In the aggregation, I calculated the number of unique products sold with lambda x: x.nunique() and concatenated the names of the unique products sorted lexicographically with lambda x: ','.join(sorted(x.unique())). df.nunique() counts the number of distinct elements in specified axis and df.unique() returns unique values.

After a renaming and sorting by sell_date value, I returned the desired DataFrame.

Daily Leads and Partners

Q: For each date_id and make_name, find the number of distinct lead_id's and distinct partner_id's. Return the result table in any order.

def daily_leads_and_partners(daily_sales: pd.DataFrame) -> pd.DataFrame:
    daily_sales = daily_sales.groupby(by=["date_id", "make_name"]).nunique().reset_index()
    return daily_sales.rename(columns={"lead_id":"unique_leads", "partner_id":"unique_partners"})

First, I grouped the daily_sales DataFrame by date_id and make_name using df.groupby() to aggregate the counts of distinct values. In the aggregation, I used df.nunique() to calculate the number of distinct lead_id's and partner_id's for each group. Finally, I reset the index and renamed the columns to unique_leads and unique_partners to reflect the counts correctly.

Actors and Directors Who Cooperated At Least Three Times

Q: Write a solution to find all the pairs (actor_id, director_id) where the actor has cooperated with the director at least three times.

Return the result table in any order.

def actors_and_directors(actor_director: pd.DataFrame) -> pd.DataFrame:
    actor_director = actor_director.groupby(by=["actor_id","director_id"]).count().reset_index()
    actor_director = actor_director[actor_director["timestamp"] >= 3]

    return actor_director[["actor_id", "director_id"] ]

In this question, I used df.groupby() for 2 columns at the same time since the count of same and different pairs are asked. After grouping, I filtered the entries that have count smaller than 3 and returned the DataFrame.

Replace Employee ID With The Unique Identifier

Q: Write a solution to show the unique ID of each user, If a user does not have a unique ID replace just show null. Return the result table in any order.

def replace_employee_id(employees: pd.DataFrame, employee_uni: pd.DataFrame) -> pd.DataFrame:
    result = pd.merge(employees, employee_uni, how="outer")
    return result[["unique_id","name"]].dropna(subset=["name"])

First of all, I merged employees and employee_uni DataFrames. Notice that I used how='outer' parameter that is for union of DataFrames. In this way, I am able to place null value for employees that don’t have unique ID. Finally I returned the resulted DataFrame with name and unique ID. Notice that I used a df.dropna() function because of a test case that causes a name value of some entries to be null after merge.

Students and Examinations

Q: Write a solution to find the number of times each student attended each exam. Return the result table ordered by student_id and subject_name.

def students_and_examinations(students: pd.DataFrame, subjects: pd.DataFrame, examinations: pd.DataFrame) -> pd.DataFrame:

    examinations = examinations.groupby(['student_id', 'subject_name']).agg(attended_exams=('subject_name', 'count')).reset_index()

    df = pd.merge(students, subjects, how = 'cross').sort_values(by = ['student_id' , 'subject_name'])

    df = pd.merge( df, examinations, how = 'left', on = ['student_id', 'subject_name'])

    df['attended_exams'] = df['attended_exams'].fillna(0)
    return df[['student_id', 'student_name', 'subject_name', 'attended_exams']]

First of all, I grouped and aggregated entries in examinations DataFrame to create a column of counts of exams that each student took on each topic. Then, I merged students and subjects with how='cross' parameter to take the cross product of these DataFrames. Finally, I merged these 2 modified DataFrames with how=’left’ parameter to conduct a left outer join. I added the value 0 instead of null values and returned the resulted DataFrame.

Managers with at Least 5 Direct Reports

Q: Write a solution to find managers with at least five direct reports. Return the result table in any order.

def find_managers(employee: pd.DataFrame) -> pd.DataFrame:
    df = employee.groupby(['managerId'])['id'].count().reset_index()
    df = df.loc[df['id']>=5,['managerId']]
    df = employee.loc[employee['id'].isin(df['managerId']),['name']]

    return df

First, I grouped the entries of employee DataFrame and count the id values to find the number of direct reports of each manager. I filtered out the entries with counts less than 5 and I compared each manager ID in employee DataFrame with entries in modified DataFrame which contains the managers with at least 5 direct reports. At the end, I returned the result of this comparison.

Sales Person

Q: Write a solution to find the names of all the salespersons who did not have any orders related to the company with the name “RED”. Return the result table in any order.

def sales_person(sales_person: pd.DataFrame, company: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    df = company.merge(orders, how='left')
    df = df.drop(df[df.name != 'RED'].index)

    df = sales_person.loc[sales_person['sales_id'].isin(df['sales_id']) == False,['name']]
    return df

First of all, I merged the company and orders DataFrames and from this merged DataFrame, I dropped the entries that aren’t related to the company with the name ‘RED’. Then, I compared the sales person IDs with the entries that are related to company RED. If the sales person is not matched, then it will be saved in the result DataFrame.

Well, that was the last problem in the plan. I hope you enjoyed reading and it was useful for you. I will share more stories when I solve other study plans in LeetCode. Thanks for reading!

LeetCode Study Plan: Introduction to Pandas

Ata Seren — Fri, 15 Mar 2024 21:00:13 +0000

What is Study Plan?

Introduction to Pandas Study Plan

This study plan on LeetCode aims to teach the basics of Pandas through 15 simple questions. I used Pandas few times, mostly for machine learning projects but usually, I just used the DataFrame object. I haven’t used its functions a lot. Therefore, I wanted to start with an easy plan about Pandas and learn its basics.

During my solving process, I took some notes to reinforce my understanding of Pandas functions and their capabilities. This story is written according to these notes. In this story, I’ll name the questions, provide my answers, and explain the functions used in the solutions.

Note: I won’t delve into the details of the problems, as they are available in a better format and with examples on LeetCode.

Create a DataFrame from List

Q: Write a solution to create a DataFrame from a 2D list called student_data. This 2D list contains the IDs and ages of some students. The DataFrame should have two columns, student_id and age, and be in the same order as the original 2D list.

df = pd.DataFrame(student_data)
df.columns = ["student_id", "age"]
return df

Creates a DataFrame with specific column names. df.columns is used to name the columns of a DataFrame.

Get the Size of a DataFrame

Q: Write a solution to calculate and display the number of rows and columns of players. Return the result as an array:
[number of rows, number of columns]

return [players.shape[0], players.shape[1]]

df.shape returns a tuple of rows and columns of DataFrame: [row_count, column_count]

Display the First Three Rows

Q: Write a solution to display the first 3 rows of this DataFrame.

return employees.head(3)

df.head(n) returns the first n rows of df.

Select Data

Q: Write a solution to select the name and age of the student with student_id = 101.

return students.loc[students['student_id'] == 101, ['name', 'age']]

In df.loc, the first parameter is the condition used for the search, and the second parameter is a list with desired columns.

Here is an example with multiple conditions:

students.loc[(students['student_id'] == 101) & (students['name'] == "Ulysses"), ['name', 'age']]

Create a New Column

Q: A company plans to provide its employees with a bonus. Write a solution to create a new column name bonus that contains the doubled values of the salary column.

bonus = []
for s in employees["salary"]:
    bonus.append(s*2)

result = employees.assign(bonus=bonus)
return result

In df.assign, there is a column name and a list of values to be used in the column:

df.assign(column_name=[element1, element2, element3])

Note: In the problem “Modify Columns”, I gave some examples about modifying the values in a column of a DataFrame, similar to one I did it in this problem but not at the outside of the DataFrame.

Drop Duplicate Rows

Q: There are some duplicate rows in the DataFrame based on the email column. Write a solution to remove these duplicate rows and keep only the first occurrence.

df = customers.drop_duplicates(subset=['email'])
return df

df.drop_duplicates simply drops duplicates according to the values given in a column or columns. You can give multiple values as below:

dedup_df = df.drop_duplicates(subset=['A', 'B'])

Drop Missing Data

Q: There are some rows having missing values in the name column. Write a solution to remove the rows with missing values.

return students.dropna(subset=['name'])

df.dropna drops the rows with missing values. In this question, a column name is given to drop the rows with missing values if they are only in the given column. df.dropna can take various parameters to handle missing values in different ways.

Modify Columns

Q: A company intends to give its employees a pay rise. Write a solution to modify the salary column by multiplying each salary by 2.

employees.salary = employees.salary*2
return employees

In this question, it is asked to double the values of a column, and I directly accessed the column with df.row_name and doubled its values.

Here is an additional example:

import numpy as np

# Step 1: Select the column
age_column = df['age']

# Step 2: Apply a function to each value
def sqrt(x):
    return np.sqrt(x)
new_age_column = age_column.apply(sqrt)

# Step 3: Assign the new values back to the column
df['age'] = new_age_column

df.apply is used to apply a function to each value in the column.

Rename Columns

Q: Write a solution to rename the columns as follows:

id to student_id
first to first_name
last to last_name
age to age_in_years

return students.rename(columns = {'id':'student_id', 'first':'first_name', 'last':'last_name', 'age':'age_in_years'})

df.rename can be used to change the names of the index or columns like in this case. With inplace=True parameter, df can be modified instead of creating a new one.

Change Data Type

Q: Write a solution to correct the errors: The grade column is stored as floats, convert it to integers.

return students.astype({'grade': int})

df.astype is used to change the data type of an object in the DataFrame. It can be used for specific or all columns. To solve this question, different approaches can be used, such as df.applyto all elements in a column or df.to_numeric to convert non-numeric objects into numeric ones if possible.

Fill Missing Data

Q: Write a solution to fill in the missing value as 0 in the quantity column.

products['quantity'] = products['quantity'].fillna(0)
return products

In this case, it is asked to fill missing data in a single column. That’s why I operated on the quantitycolumn. df.fillna(x) can be used to replace all missing values with the given parameter x.

To achieve the same result, df.replace could be used too:

df['DataFrame Column'] = df['DataFrame Column'].replace(np.nan, 0)

Reshape Data: Concatenate

Q: Write a solution to concatenate these two DataFrames vertically into one DataFrame.

return pd.concat([df1, df2], axis=0)

pd.concat can be used to concatenate 2 DataFrames horizontally (same rows, new columns) or vertically (same columns, new rows). axis = 0 is for vertical, and axis = 1 is for horizontal concatenation. Other than this, pd.merge, df.append, and df.join can be used for concatenation.

# Concatenation with pd.merge
result = pd.merge(df, df1, on='Courses', how='outer', suffixes=('_df1', '_df2')).fillna(0)

result['Fee'] = result['Fee_df1'] + result['Fee_df2']
result = result[['Courses', 'Fee']]

# Concatenation with df.join
result = df.join(df1)

# Concatenation with df.append (only vertical concatenation)
result = df.append(df1, ignore_index=True)

Reshape Data: Pivot

Q: Write a solution to pivot the data so that each row represents temperatures for a specific month, and each city is a separate column.

return weather.pivot(index='month', columns='city', values='temperature')

df.pivot is used to pivot a DataFrame with 3 columns. This function is used to reshape to a simpler, smaller DataFrame that the same meaning can be deduced from it. With this function, the index and columns of a DataFrame can be set, and the new DataFrame can be filled with desired values.

Reshape Data: Melt

Q: Write a solution to reshape the data so that each row represents sales data for a product in a specific quarter.

return pd.melt(report, id_vars=['product'], value_vars=['quarter_1', 'quarter_2', 'quarter_3', 'quarter_4'], var_name='quarter', value_name='sales')

pd.melt reshapes a DataFrame to be more computer-friendly. In this problem, pd.melt is used to merge values of multiple columns into a single column. The names of the columns are also used as variable names for the values.

Method Chaining

Q:Write a solution to list the names of animals that weigh strictly more than 100 kilograms. Return the animals sorted by weight in descending order.

return animals[animals['weight'] > 100].sort_values(['weight'], ascending=False)[['name']]

Method chaining is a newer approach to data manipulation, allowing for the execution of multiple operations in a single line of code. With method chaining, each operation is chained together using the dot notation.

Well, that was the last problem in the plan. I hope you enjoyed reading and it was useful for you. I will share more stories when I solve other study plans in LeetCode. Thanks for reading!