DEV Community

Mhamad El Itawi
Mhamad El Itawi

Posted on

Bulk Insert/Update task

One of my duties as a team lead and senior software engineer is reviewing my team’s code.

The following was a task given to a team member:

  • The user need to upload user data by excel (user per row)
  • if the username exists, update
  • if not, create a new user

Proposed code (The wrong way)

The pseudo code written by the junior dev was as below (Some part of the code like data validation and error handling were removed for simplicity):

ParseAndUpdate (excelFile){
   List users = parseExcel (excelFile);
      for (user in users){
          if ( checkIfUserExist(user) )
             update (user)
          else
             create (user)
          save (user)
      }
}
Enter fullscreen mode Exit fullscreen mode

The developer is using an ORM library to connect to the database, so for him it looked like a straightforward task, call some “repository functions” and it’s done or as we joke internally “Work like a charm”.

Code review
Even if the code looks simple, and works, it’s drastically wrong!

There’s four issues :

  • Multiple level of abstraction and a lack of separation of concern: The current code is parsing and doing transactions on the database, which contradict with the “One level of abstraction per function” concept
  • Non reusable: If I need to introduce a new type of input files like XML, I will need to rewrite the function and call a different parser
  • Bombarding the database: the proposed solution is literally doing three hits per row ( check, update/create and save). Even if we look to the core of the ORM, we will see a lot of opening sessions, ids creation, auto ids incrementing, commits etc. If we have 3000 rows, we can expect 9000 hits on the database.
  • Time consuming process: On scale , the process will take too much time, due to the network latency/cost/overhead when hitting the database

On a side note even if we save by bulk at the end, the database will face a lot of “Database Lock” risk, because it will have a lot of waiting time before committing. This will introduce a scalability issue.

Proposed Fix

After explaining the issues to the developers I wrote the following code

UserList parseExcelIntoListOfUsers (file){
   return ExcelParser.parse(file)
}

List getAllUsernames (usersList){
   List usernames;
   for (user in users)
       usernames.add (user.username)
}

Map indexingTheAlreadyExistingUsers(existingUsers)
{
   Map <Username, index> existingUsersIndex;
   for (index = 0 ; index <= existingUsers.size ; index++)
       existingUsersIndex.add ( existingUsers.get (index).username, 
        index)
}

insertOrUpdateByBulkUsers(usersList){
    List usernames = getUsernames (usersList)
    List existingUsers = getFromDatabaseByUsernames (usernames)
    Map <Username, index> existingUsersIndexMap = 
    indexingTheAlreadyExistingUsers(existingUsers)

    var tempUser; var index;

    for(i = 0; i ‹ usersList.size ; i++){
      index = existingUsersIndexMap.getByUsername( 
        usersList.get(i).username)
      if(index == null) //create new
         usersList.get(i).postCreation //some functions we do to create custom fields
      else // update
        usersList.get(i).update(existingUsers.get (index) )
    }
    saveBulk(usersList)
}

process (file){
  usersList = parseExcelIntoListOfUsers(file);
  insertOrUpdateByBulkUsers
}
Enter fullscreen mode Exit fullscreen mode

First of all the parsing process is extracted from the solution. Now it’s possible to introduce a new type of parser (Kindly note that parser is written differently using an abstraction layer but here, I preferred to keep it simple).

After getting the list of users the process will be like this:

  1. We get the list of all usernames
  2. We search for them in the database using 1 hit only
  3. We save and index the list of existing users in memory
  4. Now we parse the list
  5. We do the verification in memory instead of database
  6. We do the transactions on the list of objects
  7. We save the list in bulk with 1 hit Many will ask why I added an indexing/mapping process, the idea was reducing the complexity of the code from O(n2) to O(n) using the “memoization” concept. Instead of searching in the list for username. I benefit from the map O(1) search complexity.

Limitations and possible enhancements

Nonetheless, our new solution isn’t perfect. Because technically the enhancements in performance, speed and scalability come at the expense of 2 main elements:

Memory usage: more memory will be used to save lists and map
More resources will be consumed temporarily on the database engine on hits
To keep the leverage acceptable, it’s good to introduce some limitations (LIMIT) on the size list by example 5000 . If this is not acceptable from a business perspective, the solution will be chunking the batch and doing it by bulk of LIMIT (example 5000) on backend. Even a queuing system can be introduced.

Possible enhancements architecture

Do your career a big favor. Join DEV. (The website you're on right now)

It takes one minute, it's free, and is worth it for your career.

Get started

Community matters

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Engage with a sea of insights in this enlightening article, highly esteemed within the encouraging DEV Community. Programmers of every skill level are invited to participate and enrich our shared knowledge.

A simple "thank you" can uplift someone's spirits. Express your appreciation in the comments section!

On DEV, sharing knowledge smooths our journey and strengthens our community bonds. Found this useful? A brief thank you to the author can mean a lot.

Okay