Bulk Insert/Update task

#softwaredevelopment #architecture #sql #programming

One of my duties as a team lead and senior software engineer is reviewing my team’s code.

The following was a task given to a team member:

The user need to upload user data by excel (user per row)
if the username exists, update
if not, create a new user

Proposed code (The wrong way)

The pseudo code written by the junior dev was as below (Some part of the code like data validation and error handling were removed for simplicity):

ParseAndUpdate (excelFile){
   List users = parseExcel (excelFile);
      for (user in users){
          if ( checkIfUserExist(user) )
             update (user)
          else
             create (user)
          save (user)
      }
}

The developer is using an ORM library to connect to the database, so for him it looked like a straightforward task, call some “repository functions” and it’s done or as we joke internally “Work like a charm”.

Code review
Even if the code looks simple, and works, it’s drastically wrong!

There’s four issues :

Multiple level of abstraction and a lack of separation of concern: The current code is parsing and doing transactions on the database, which contradict with the “One level of abstraction per function” concept
Non reusable: If I need to introduce a new type of input files like XML, I will need to rewrite the function and call a different parser
Bombarding the database: the proposed solution is literally doing three hits per row ( check, update/create and save). Even if we look to the core of the ORM, we will see a lot of opening sessions, ids creation, auto ids incrementing, commits etc. If we have 3000 rows, we can expect 9000 hits on the database.
Time consuming process: On scale , the process will take too much time, due to the network latency/cost/overhead when hitting the database

On a side note even if we save by bulk at the end, the database will face a lot of “Database Lock” risk, because it will have a lot of waiting time before committing. This will introduce a scalability issue.

Proposed Fix

After explaining the issues to the developers I wrote the following code

UserList parseExcelIntoListOfUsers (file){
   return ExcelParser.parse(file)
}

List getAllUsernames (usersList){
   List usernames;
   for (user in users)
       usernames.add (user.username)
}

Map indexingTheAlreadyExistingUsers(existingUsers)
{
   Map <Username, index> existingUsersIndex;
   for (index = 0 ; index <= existingUsers.size ; index++)
       existingUsersIndex.add ( existingUsers.get (index).username, 
        index)
}

insertOrUpdateByBulkUsers(usersList){
    List usernames = getUsernames (usersList)
    List existingUsers = getFromDatabaseByUsernames (usernames)
    Map <Username, index> existingUsersIndexMap = 
    indexingTheAlreadyExistingUsers(existingUsers)

    var tempUser; var index;

    for(i = 0; i ‹ usersList.size ; i++){
      index = existingUsersIndexMap.getByUsername( 
        usersList.get(i).username)
      if(index == null) //create new
         usersList.get(i).postCreation //some functions we do to create custom fields
      else // update
        usersList.get(i).update(existingUsers.get (index) )
    }
    saveBulk(usersList)
}

process (file){
  usersList = parseExcelIntoListOfUsers(file);
  insertOrUpdateByBulkUsers
}

First of all the parsing process is extracted from the solution. Now it’s possible to introduce a new type of parser (Kindly note that parser is written differently using an abstraction layer but here, I preferred to keep it simple).

After getting the list of users the process will be like this:

We get the list of all usernames
We search for them in the database using 1 hit only
We save and index the list of existing users in memory
Now we parse the list
We do the verification in memory instead of database
We do the transactions on the list of objects
We save the list in bulk with 1 hit Many will ask why I added an indexing/mapping process, the idea was reducing the complexity of the code from O(n2) to O(n) using the “memoization” concept. Instead of searching in the list for username. I benefit from the map O(1) search complexity.

Limitations and possible enhancements

Nonetheless, our new solution isn’t perfect. Because technically the enhancements in performance, speed and scalability come at the expense of 2 main elements:

Memory usage: more memory will be used to save lists and map
More resources will be consumed temporarily on the database engine on hits
To keep the leverage acceptable, it’s good to introduce some limitations (LIMIT) on the size list by example 5000 . If this is not acceptable from a business perspective, the solution will be chunking the batch and doing it by bulk of LIMIT (example 5000) on backend. Even a queuing system can be introduced.

🌐 For more tech insights, you can find me on LinkedIn.

DEV Community

Bulk Insert/Update task

Proposed code (The wrong way)

Proposed Fix

Limitations and possible enhancements

Top comments (0)