Principles of Importer Design
Security
- Uploaded files must be validated. Examples:
Logging
- Logs should contain the importer type such as
github
,bitbucket
,bitbucket_server
. You can find a full list of import sources inGitlab::ImportSources
. - Logs should include any information likely to aid in debugging:
- Object identifiers such as
id
,iid
, and type of object - Error or status messages
- Object identifiers such as
- Logs should not include sensitive or private information, including but not limited to:
- Usernames
- Email addresses
- Where applicable, we should track the error in
Gitlab::Import::ImportFailureService
to aid in displaying errors in the UI. - Logging should raise an error in development if key identifiers are missing, as demonstrated in this MR.
- A log line should be created before and after each record is imported, containing that record's identifier.
Performance
- A cache with a default TTL of 24 hours should be used to prevent duplicate database queries and API calls.
- Workers that loop over collections should be equipped with a progress pointer that allows them to pick up where they left off if interrupted.
- Write-heavy workers should implement
defer_on_database_health_signal
to avoid saturating the database. However, at the time of writing, a known issue prevents us from using this. - We should enforce limits on worker concurrency to avoid saturating resources. You can find an example of this in the Bitbucket
ParallelScheduling
class. - Importers should be tested at scale on a staging environment, especially when implementing new functionality or enabling a feature flag.
Resilience
- Workers should be idempotent so they can be retried safely in the case of failure.
- Workers should be re-enqueued with a delay that respects concurrent batch limits.
- Individual workers should not run for a long time. Workers that run for a long time can be interrupted by Sidekiq due to a deploy, or be misidentified by
StuckProjectImportJobsWorker
as being part of an import that is stuck and should be failed.- If a worker must run for a long time it must refresh its JID using
Gitlab::Import::RefreshImportJidWorker
to avoid being terminated byStuckProjectImportJobsWorker
. It may also need to raise its Sidekiqmax_retries_after_interruption
. Refer to the GitHub importer implementation.
- If a worker must run for a long time it must refresh its JID using
- Workers that rely on cached values must implement fall-back mechanisms to fetch data in the event of a cache miss.
- Re-fetch data if possible and performant.
- Gracefully handle missing values.
- Long-running workers should be annotated with
worker_resource_boundary :memory
to place them on a shard with a two hour termination grace period. A long termination grace period is not a replacement for writing fast workers. Apdex SLO compliance can be monitored on the I&I team Grafana dashboard. - Workers that create data should not fail an entire import if a single record fails to import. They must log the appropriate error and make a decision on whether or not to retry based on the nature of the error.
- Import Stage workers (which include
StageMethods
) and Advance Stage workers (which includeGitlab::Import::AdvanceStage
) should haveretries: 6
to make them more resilient to system interruptions. With exponential back-off, six retries spans approximately 20 minutes. Any higher retry holds up an import for too long. - It should be possible to retry a portion of an import, for example re-importing missing issues without overwriting the entire destination project.
Consistency
- Importers should fire callbacks after saving records. Problematic callbacks can be disabled for imports on an individual basis:
- Include the
Importable
module. - Configure the callback to skip if
importing?
. - Set the
importing
value on the object under import.
- Include the
- If records must be inserted in bulk, consider manually running callbacks.