Batching best practices
This document gives an overview about the available batching strategies we use at GitLab. We list the pros and cons of each strategy so engineers can pick the ideal approach for their use case.
Why do we need batching
When dealing with a large volume of records, reading, updating or deleting the records in one database query can be challenging; the operation could easily time out. To avoid this problem, we should process the records in batches. Batching usually happens in background jobs, where runtime constraints are more relaxed than during web requests.
Use batching in background jobs and not in web requests
In rare cases (older features), batching also happens in web requests. However, for new features this is discouraged due to the short web request timeout (60 seconds by default). As a guideline, using background jobs (Sidekiq workers) should be considered as the first option when implementing a feature that needs to process a large volume of records.
Performance considerations
Batching performance is closely related to pagination performance since the underlying libraries and database queries are essentially the same. When implementing batching it's important to be familiar with the pagination performance guidelines and the documentation related to our batching utilities.
Batching in background jobs
There are two main aspects to consider when implementing batching in background jobs: total runtime and data modification volume.
Background jobs shouldn't run for a long time. A Sidekiq process can crash or it can be forcefully stopped (e.g. on restart or deployment). Additionally, due to our error budget rules, after 5 minutes of runtime, error budget violations will be added to the group where the feature is registered. When implementing batching in background jubs, make sure that you're familiar with our guidelines related to idempotent jobs
Updating or deleting a large volume of records can increase database replication lag and it can add extra strain to the primary database. It's advisable to limit the total number of records we process (or batch over) within the background job.
To address the potential issues mentioned above the following measures should be considered:
- Limit the total runtime for the job.
- Limit record modifications.
- Rest period between batches. (a few milliseconds)
When applying limits, it's important to mention that long-running background jobs should implement a "continue later" mechanism where a new job is scheduled after the limit is reached to continue the work where the batching was stopped. This is important when a job is so long that it's very likely that it won't fit into the 5 minutes runtime.
An example implementation of runtime limiting using the Gitlab::Metrics::RuntimeLimiter
class:
def perform(project_id)
runtime_limiter = Gitlab::Metrics::RuntimeLimiter.new(3.minutes)
project = Project.find(1)
project.issues.each_batch(of: :iid) do |scope|
scope.update_all(updated_at: Time.current)
break if runtime_limiter.over_time?
end
end
The batching in the code snippet stops when 3 minutes of runtime is reached. The problem now is that we have no way to continue the processing. To do that, we need to schedule a new background job with enough information to continue the processing. In the snippet, we batch issues within a project by the iid
column. For the next job, we need to provide the project ID and the last processed iid
values. This information we often call as the cursor.
def perform(project_id, iid = nil)
runtime_limiter = Gitlab::Metrics::RuntimeLimiter.new(3.minutes)
project = Project.find(project_id)
# Restore the previous iid if present
project.issues.where('iid > ?', iid || 0).each_batch(of: :iid) do |scope|
max_iid = scope.maximum(:iid)
scope.update_all(updated_at: Time.current)
if runtime_limiter.over_time?
MyJob.perform_in(2.minutes, project_id, iid)
break
end
end
end
Implementing a "continue later" mechanism can add significant complexity to the implementation. Hence, before committing to this work, analyze the existing data in the production database and try to extrapolate data growth. A few examples:
- Mark all
pending
todos for a given user asdone
does not need a "continue later" mechanism.- Reasoning: The number of pending todos will most likely not going to be over a few thousand database rows, even for the busiest users. Updating these rows would finish 99.9% of the time under 1 minute.
- Store CI build records in a CSV files within a given project might require a "continue later" mechanism.
- Reasoning: for very active projects, CI job count can grow at a very high rate into millions of rows.
When a very large volume of updates happen in the background job, it's advisable (not a strict requirement) to add some sleep to the code and limit the total number of records we update. This reduces the pressure on the primary databases and gives a small window for potential database migrations to acquire heavier locks.
def perform(project_id, iid = nil)
max_updates = 100_000 # Allow maximum N updates
updates = 0
status = :completed
runtime_limiter = Gitlab::Metrics::RuntimeLimiter.new(3.minutes)
project = Project.find(project_id)
project.issues.where('iid > ?', iid || 0).each_batch(of: :iid) do |scope|
max_iid = scope.maximum(:iid)
updates += scope.update_all(updated_at: Time.current)
if runtime_limiter.over_time? || updates >= max_updates
MyJob.perform_in(2.minutes, project_id, iid)
status = :limit_reached
break
end
# Adding sleep when we expect long running batching that modifies large volume of data
sleep 0.01
end
end
Traceability
For traceability purposes, it's a good practice to expose metrics so we can see how the batching performs in Kibana:
log_extra_metadata_on_done(:result, {
status: :limit_reached, # or :completed
updated_rows: updates
})
Scheduling of the next jobs
Scheduling the next job in the example above is not crash safe (the job can be lost), for very important tasks this approach is not suitable. A safe and common pattern is using a scheduled worker that executes the work based on a cursor. The cursor can be persisted in the DB or in Redis depending on the consistency requirements. This means that the cursor is no longer passed via the job arguments.
The frequency of the scheduled worker can be adjusted depending on the urgency of the task. We have examples when a scheduled worker is enqueued every minute to process urgent items.
Redis based cursor
Example: process all issues in a project.
def perform
project_id, iid = load_cursor # Load cursor from Redis
return unless project_id # Nothing was enqueued
project = Project.find(project_id)
project.issues.where('iid > ?', iid || 0).each_batch(of: :iid) do |scope|
# Do something with issues.
# Break here, set interrupted flag if time limit is up.
# Set iid to the last processed value.
end
# Continue the work later
push_cursor(project_id, iid) if interrupted?
end
private
def load_cursor
# Take 1 element, not crash safe.
raw_cursor = Gitlab::Redis::SharedState.with do |redis|
redis.lpop('my_cursor')
end
return unless raw_cursor
cursor = Gitlab::Json.parse(raw_cursor)
[cursor['project_id'], cursor['iid']]
end
def push_cursor(project_id, iid)
# Work is not finished, put the cursor at the beginning of the list so the next job can pick it up.
Gitlab::Redis::SharedState.with do |redis|
redis.lpush('my_cursor', Gitlab::Json.dump({ project_id: project_id, iid: iid }))
end
end
In the application code, you can put an item in the queue after the database transaction commits (see transaction guidelines for more details):
def execute
ApplicationRecord.transaction do
user.save!
Event.create!(user: user, issue: issue)
end
# Application could crash here
MyRedieQueue.add(user: user, issue: issue)
end
This approach is not crash-safe, the item would not be enqueued if the application crashes right after the transaction commits.
Pros:
- Easier to implement, no extra database table is needed for tracking the jobs.
- Good for low throughput, internally invoked jobs. (example: full-table periodical consistency checks, background aggregations)
Cons:
- Scheduling the work (putting the cursor in the queue) is not crash safe.
- Potential serialization issues when the cursor is read. (multi-version compability)
- Extra care needs to be taken about database transactions.
PostgreSQL based cursor
An alternative approach would be storing the queue in the PostgreSQL database. In this case, we can implement the transactional outbox pattern which ensures consistency in case of application (web or worker) crashes.
Pros:
- Scheduling the work can be made fully consistent with other record changes (example: schedule the work within the issue create transaction).
- Tolerates large number of items in the queue.
Cons:
- Depending on the volume, the implementation can be quite complex:
- Partitioned database table: this should be considered for high-throughput workers.
- Consider the sliding-window partitioning strategy.
- Complex, cross-partition queries.
Example: set up reliable way of sending emails
# In a service
def execute
ApplicationRecord.transaction do
user.save!
Event.create!(user: user, issue: issue)
IssueEmailWorkerQueue.insert!(user: user, issue: issue)
end
end
The IssueEmailWorkerQueue
record stores all necessary information for executing a job. In the scheduled background job we can process the table in a specific order.
def perform
runtime_limiter = Gitlab::Metrics::RuntimeLimiter.new(3.minutes)
items = EmailWorkerQueue.order(:id).take(25)
items.each do |item|
# Do something with the item
end
end
NOTE: To avoid parallel processing of records, you might need to wrap the execution with a distributed Redis lock.
Example Redis lock usage:
class MyJob
include ApplicationWorker
include Gitlab::ExclusiveLeaseHelpers
MAX_TTL = 2.5.minutes.to_i # It should be similar to the runtime limit.
def perform
in_lock('my_lock_key', ttl: MAX_TTL, retries: 0) do
# Do the work here.
end
end
end
Considerations for Sidekiq jobs
Sidekiq jobs can consume substantial database resources. If your job only batches over data but does not modify anything in the database, consider setting attributes favoring database replicas. See the documentation for the Sidekiq worker attributes.
Batching strategies
NOTE: To keep the examples easy to follow, we omit the code for limiting the runtime.
NOTE:
Some examples include an optional variable assignment to the cursor
variable. This is optional step which can be used when implementing the "continue later" mechanism.
Loop-based batching
The strategy leverages the fact that after updating or deleting records in the database, the exact same query will return different records. This strategy can only be used when we want to delete or update certain records.
Example:
loop do
# Requires an index on project_id
delete_count = project.issues.limit(1000).delete_all
break if delete_count == 0 # Exit the loop when there are not records to be deleted
end
Pros:
- Easy to implement, maintaining a cursor is not required.
- A single-column database index is sufficient to implement the batching which is often available (foreign keys).
- If order is not important, complex filter conditions can be also used as long as they're covered with an index.
Cons:
- Thorough testing and manual verification of the underlying
DELETE
orUPDATE
query is a must. There are some issues with CTEs when updating or deleting records. - If the
break
logic has a bug we might end up in an infinite loop.
It's possible to make the loop-based approach process records in a specific order:
loop do
# Requires a composite index on (project_id, created_at)
delete_count = project.issues.limit(1000).order(created_at: :desc).delete_all
break if delete_count == 0
end
With the index mentioned in the previous example, we can also use timestamp
conditions:
loop do
# Requires a composite index on (project_id, created_at)
delete_count = project
.issues
.where('created_at < ?', 1.month.ago)
.limit(1000)
.order(created_at: :desc)
.delete_all
break if delete_count == 0
end
Single-column batching
We can use a single, unique column (primary key or column which has a unique index) for batching with the EachBatch
module. This is one of the most commonly used batching strategy in GitLab.
# Requires a composite index on (project_id, id).
# EachBatch uses the primary key by default for the batching.
cursor = nil
project.issues.where('id > ?', cursor || 0).each_batch do |batch|
issues = batch.to_a
cursor = issues.last.id # For the next job
# do something with the issues records
end
Pros:
- The most popular way of batching within the GitLab application.
- Easy to implement, covers a wide range of use cases.
Cons:
- The
ORDER BY
column (ID) must be unique in the context of the query. - It does not work efficiently when
timestamp
column condition or other complex conditions (IN
,NOT EXISTS
) are present.
Batching over distinct values
EachBatch
requires a unique database column (usually the ID column) however, there are rare cases when the feature needs to batch over a non-unique column. Example: bump all project timestamp
values which have at least one issue.
One approach is to batch over the "parent" table, in this case using the Project
model.
cursor = nil
# Uses the primary key index
Project.where('id > ?', cursor || 0).each_batch do |batch|
cursor = batch.maximum(:id) # For the next job
project_ids = batch
.where('EXISTS (SELECT 1 FROM issues WHERE projects.id=issues.project_id)')
.pluck(:id)
Project.where(id: project_ids).update_all(update_all: Time.current)
end
Pros:
- When the column is a foreign key, batching the parent table's primary key should be already covered with an index.
Cons:
- Can be wasteful when the extra condition within the block would match only a small number of rows.
The batching query runs a full table scan over the projects
table which might be wasteful, alternatively, we can use the distinct_each_batch
helper method:
# requires an index on (project_id)
Issue.distinct_each_batch(column: :project_id) do |scope|
project_ids = scope.pluck(:project_id)
cursor = project_ids.last # For the next job
Project.where(id: project_ids).update_all(update_all: Time.current)
end
Pros:
- When the column is a foreign key column then index is already available.
- It can significantly reduce the amount of data the batching logic needs to scan.
Cons:
- Limited usage, not widely used.
Keyset-based batching
Keyset-based batching allows you to iterate over records in a specific order where multi-column sorting is also possible. The most common use cases are when we need to process data ordered via a timestamp
column.
Example: delete issue records older than one year.
def perform
cursor = load_cursor || {}
# Requires a composite index on (created_at, id) columns
scope = Issue.where('created_at > ?', 1.year.ago).order(:created_at, :id)
iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope, cursor: cursor)
iterator.each_batch(of: 100) do |records|
loaded_records = records.to_a
loaded_records.each { |record| record.destroy } # Calling destroy so callbacks are invoked
end
cursor = iterator.send(:cursor) # Store the cursor after this step, for the next job
end
With keyset-based batching, you could adjust the ORDER BY
clause to match the column configuration of an existing index. Consider the following index:
CREATE INDEX issues_search_index ON issues (project_id, state, created_at, id)
This index cannot be used by the snippet above because the ORDER BY
column list doesn't match exactly the column list in the index definition. However, if we alter the ORDER BY
clause then the index would be picked up by the query planner:
# Note: this is a different sort order but at least we can use an existing index
scope = Issue.where('created_at > ?', 1.year.ago).order(:project_id, :state, :created_at, :id)
Pros:
- Multi-column sort orders and more complex filtering are possible.
- You might be able to reuse existing indexes without introducing new ones.
Cons:
- Cursor size could be larger (each
ORDER BY
column will be stored in the cursor).
Offset batching
This batching technique uses offset pagination when loading new records. Offset pagination should be used only as a last resort when the given query cannot be paginated via EachBatch
or via keyset-pagination. One reason for choosing this technique is when there is no suitable index available for the SQL query to use a different batching technique. Example: in a background job we load too many records without limit and it started to time out. The order of the records are important.
def perform(project_id)
# We have a composite index on (project_id, created_at) columns
issues = Issue
.where(project_id: project_id)
.order(:created_at)
.to_a
# do something with the issues
end
As the number of issues within the project grows, the query gets slower and eventually times out. Using a different batching technique such as keyset-pagination is not possible because the ORDER BY
clause is depending on a timestamp
column which is not unique (see the tie-breaker section). Ideally, we should order on the created_at, id
columns, however we don't have that index available. In a time-sensitive scenario (such as an incident) it might not be feasible to introduce a new index right away so as a last resort we can attempt offset pagination.
def perform(project_id)
page = 1
loop do
issues = Issue.where(project_id: project_id).order(:created_at).page(page).to_a
page +=1
break if issues.empty?
# do something with the issues
end
end
The snippet above can be a short term fix until a proper solution is in place. It's important to note that offset pagination gets slower as the page number increases which means that there might be a chance where the offset paginated query times out the same way as the original query. The chances are reduced to some extent by the database buffer cache which keeps the previously loaded records in memory; Thus, the consecutive (short-term) lookup of the same rows will not have very high impact on the performance.
Pros:
- Easy to implement.
Cons:
- Performance degrades linearly as the page number is increased.
- This is only a stop-gap measure which shouldn't be used for new features.
- You can store the page number as the cursor but restoring the processing from the previous point can be unreliable.
Batching over the Group hierarchy
We have several features where we need to query data in the top-level namespace and its subgroups. There are outlier group hierarchies which contain several thousand subgroups or projects. Querying such hierarchies can easily lead to database statement timeouts when additional subqueries or joins are added.
Example: iterate over issues in a group
group = Group.find(9970)
Issue.where(project_id: group.all_project_ids).each_batch do |scope|
# Do something with issues
end
The example above will load all subgroups, all projects and all issues in the group hierarchy which will very likely lead to database statement timeout. The query above can be slightly improved with database indexes as a short-term solution.
Using the in-operator optimization
When you need to process records in a specific order in a group, you can use the in-operator optimization which can provide better performance than using a standard each_batch
based batching strategy.
You can see an example for batching over records in the group hierarchy here.
Pros:
- This is the only way to batch over records efficiently within the group hierarchy in a specific order.
Cons:
- Requires more complex setup.
- Batching over very large hierarchies (high number of projects or subgroups) will require lower batch size.
Always batch from the top-level group
This technique can be used when we always have to batch from the top level group (group without parent group). In this case we can leverage the following index in the namespaces
table:
"index_on_namespaces_namespaces_by_top_level_namespace" btree ((traversal_ids[1]), type, id) -- traversal_ids[1] is the top-level group id
Example batching query:
Namespace.where('traversal_ids[1] = ?', 9970).where(type: 'Project').each_batch do |project_namespaces|
project_ids = Project.where(project_namespace_id: project_namespaces.select(:id)).pluck(:id)
cursor = project_namespaces.last.id # For the next job
project_ids.each do |project_id|
Issue.where(project_id: project_id).each_batch(column: :iid) do |issues|
# do something with the issues
end
end
end
Pros:
- Loading the whole group hierarchy can be avoided.
- Processing evenly distributed batches using a nested
EachBatch
.
Cons:
- More database queries due to the double batching.
Batch from any node from the group hierarchy
Using the NamespaceEachBatch
class allows us to batch a specific branch of the group hierarchy (tree).
# current_id: id of the namespace record where we iterate from
# depth: depth of the tree where the iteration was stopped previously. Initially, it should be the same as the current_id
cursor = { current_id: 9970, depth: [9970] } # This can be any namespace id
iterator = Gitlab::Database::NamespaceEachBatch.new(namespace_class: Namespace, cursor: cursor)
# Requires a composite index on (parent_id, id) columns
iterator.each_batch(of: 100) do |ids, new_cursor|
namespace_ids = Namespaces::ProjectNamespace.where(id: ids)
cursor = new_cursor # For the next job, contains the new current_id and depth values
project_ids = Project.where(project_namespace_id: namespace_ids)
project_ids.each do |project_id|
Issue.where(project_id: project_id).each_batch(column: :iid) do |issues|
# do something with the issues
end
end
end
Pros:
- It can process the group hierarchy from any node.
Cons:
- Rarely used, useful in only very rare use cases.
Batching over complex queries
We consider complex queries where the query contains multiple filters and joins. Most of the time these queries cannot be batched easily. A few examples:
- Use
JOIN
to filter out rows. - Use subqueries.
- Use multiple
IN
filters or complexAND
orOR
conditions.