Avoiding I/O performance pitfalls
Cover Photo by Önder Örtel on Unsplash
I/O operations are expensive. As an application scales I/O operations can become serious bottlenecks. Whether an application pulls data from a database, a web service, or an IoT device, it can suffer from the same problem: expensive I/O operations inside of loops. Here we will view the problem through the lens of Entity Framework in C#.
Identifying the Problem
Without experience with this issue, it can be easy to overlook during initial development. Unit tests, integration tests, and manual QA testing may not use large enough datasets to identify problems preemptively. If not caught ahead of time, the issue will certainly be noticed in production. Here is what it might look like:
public User[] GetUsers() {
User[] users = GetRawUsers();
foreach (var user in users)
{
// ...
SomeExpensiveOperation(user);
// ...
{
return users;
}
Let’s say that GetRawUsers()
and SomeExpensiveOption()
take 100ms each to run. If there are only 10 users in the users array, then GetUser()
takes 1100ms to run.
100ms + (100ms * user_count)
You can see the problem with scale now. The cost in time increases linearly with the number of users. With only 100 users, it will take 10,100ms to run the GetUsers()
method and even worse if SomeExpensiveOperation()
contains a loop of its own.
How to address it
In real-world scenarios, the problem code might be more subtle than the above contrived scenario. Here are some ways to identify such cases.
Be mindful of when a query is executed
One common cause for this issue when using Entity Framework, and I suppose other frameworks that provide query building, is not paying attention to when the query is executed. For example:
var usersWithGreenEyes = db.Users.Where(u => u.EyeColor == "green");
var ages = List<int>();
foreach (var user in usersWithGreenEyes)
{
ages.Add(DateTime.Today.Year - user.DateOfBirth.Year);
}
Because Where()
does not execute a query, each iteration of the loop will fetch the in-scope user record from the database creating a database call per user.
In Entity Framework, this can be solved by ensuring the query is executed before operations on the data begin by calling a method that takes the IQueryable
returned by Where()
and produces an List
or Array
.
var usersWithGreenEyes = db.Users.Where(u => u.EyeColor == "green").ToList();
Now only one query will be executed and before the loop starts.
Be mindful of lazy loading
In Entity Framework, if lazy loading is turned on, seemingly benign references to child entities and child collections, could execute a query. For example:
var users = db.Users.ToList();
var distinctTraveledCountries = new HashSet<string>();
foreach (var user in users)
{
distinctTraveledCountries.UnionWith(user.TraveledCounties);
}
In the above example, we made sure the execute the query before entering the loop, but if lazy loading is turned on, each access to a child collection (like user.TraveledCountries
) will still execute a database query to get the child data.
This can also happen when using a mapper library. For example, if I have an array of user entities and I want to map that to an array of some type of user view that contains the same child collection of traveled countries.
IEnumerable<UserView> userViews = mapper.Map(userEntities);
The same lazy loading mechanism from the previous example will cause issues here as well since the mapper library will internally access each user in the array.
Often the solution for both of these cases is the same: eager-load the child collections and child entities, either explicitly or by mapping as part of the query.
// Eager loading explicitly
var usersWithChildren = db.Users
.Include(u => u.TraveledCountries)
.Include(u => u.AccountInfo)
.ToList();
The above will eager load both the TraveledCountries
collection as well as the child member AccountInfo
, so that those properties will be prefetched along with the rest of the User entity.
In Entity Framework this generates a single query which will include left joins on the child data. In some cases this could cause performance issues itself. One solution would be to use a split query. This looks almost the same but underneath it executes a query for the parent and a query for each included child. Instead of left joins, the child members are joined with an inner join.
// Eager loading explicitly as split query
var usersWithChildren = db.Users
.Include(u => u.TraveledCountries)
.Include(u => u.AccountInfo)
.AsSplitQuery()
.ToList();
The below will map any needed fields to a custom object type by only selecting the fields needed.
// Mapping in query
var userViews = db.Users
.Select(u => new UserView {
Id = u.Id,
Name = u.Name,
AccountInfo = u.AccountInfo
})
.ToList();
Use Paged Data and Work in Small Batches
You may find that while eager-loading data fixed your performance issue as far as the loop is concerned but it caused another performance issue with a query result set that is too large or a query that takes too long to run. In this case, it may be helpful to limit the scope of an operation. You can do this by implementing pagination. If the application is a web service, you can limit the amount of data returned and provide metadata for fetching the other pages. A response could look something like this:
{
"metadata" {
"page": 3,
"pageCount": 10,
"limit": 20,
"links": {
"self": "https://myapi.com/users?page=3&limit=20",
"first": "https://myapi.com/users?page=1&limit=20",
"previous": "https://myapi.com/users?page=2&limit=20",
"next": "https://myapi.com/users?page=4&limit=20",
"last": "https://myapi.com/users?page=10&limit=20",
}
},
"users": [
//...
]
}
If the application is an automated process, it can work in small batches instead of the entire dataset at once. Pagination for query execution in Entity Framework looks like this:
public IList<User> GetPagedUsers(int currPage, int pageSize)
{
return db.Users
.OrderBy(b => b.UserId)
.Skip(currPage * pageSize)
.Take(pageSize)
.ToList();
}
Identifying Before Production
The symptoms of the issue we are discussing usually don’t reveal themselves until applied at scale, so it can be easy for it to get past traditional automated testing and manual QA. Here are some ways to help identify it:
- Developer education. Learning about pitfalls so they can be addressed during development.
- Setting performance requirements associated with features. Whether or not you have an SLA you must adhere do, you can still include a threshold in the feature requirements before development begins.
- Code reviews
- Realistic testing against production-like environments (the earlier in the release process the better).
Need help with your software project?
My day job is working as a software engineer for Trendsic where I help businesses large and small plan and develop secure and scalable software. Reach out to me at contact@zachnology.io to discuss how I can help bring your ideas to life.
Top comments (0)