Add new relations to the direct transfer importer

At a high level, to add a new relation to the direct transfer importer, you must:

Add a new relation to the list of exported data.
Add a new ETL (Extract/Transform/Load) Pipeline on the import side with data processing instructions.
Add newly-created pipeline to the list of importing stages.
Ensure sufficient test coverage.

Export from source

There are a few types of relations we export:

ActiveRecord associations. Read from import_export.yml file, serialized to JSON, written to a NDJSON file. Each relation is exported to either a .gz file, or .tar.gz file if a collection, uploaded, and served using the REST API of destination instance of GitLab to download and import.
Binary files. For example, uploads or LFS objects.
A handful of relations that are not exported but are read from the GraphQL API directly during import.

For ActiveRecord associations, you should use NDJSON over GraphQL API for performance reasons. Heavily-nested associations can produce a lot of network requests which can slow down the overall migration.

Exporting an ActiveRecord relation

The direct transfer importer's underlying behavior is heavily based on file-based importer, which uses the import_export.yml file that describes a list of Project associations to be included in the export. A similar import_export.yml is available for Group.

For example, let's say we have a new Project association called documents. To add support for importing that new association, we must:

Add it to import_export.yml file.
Add test coverage for the new relation.
Verify that the added relation is exporting as expected.

Add it to `import_export.yml` file

NOTE: Associations listed in this file are imported from top to bottom. If you have an association that is order-dependent, put the dependencies before the associations that require them. For example, documents must be imported before merge requests, otherwise they are not valid.

Add your association to tree.project within the import_export.yml.

diff --git a/lib/gitlab/import_export/project/import_export.yml b/lib/gitlab/import_export/project/import_export.yml
index 43d66e0e67b7..0880a27dfce2 100644
--- a/lib/gitlab/import_export/project/import_export.yml
+++ b/lib/gitlab/import_export/project/import_export.yml
@@ -122,6 +122,7 @@ tree:
         - label:
           - :priorities
     - :service_desk_setting
+    - :documents
   group_members:
     - :user

NOTE: If your association is relates to an Enterprise Edition-only feature, add it to the ee.tree.project tree at the end of the file so that it is only exported and imported in Enterprise Edition instances of GitLab.

If your association doesn't need to include any sub-relations, then this is enough. But if it needs more sub-relations to be included (for example, notes), you must list them out. Let's say documents can have notes (with award emojis on notes) and award emojis (on documents), which we want to migrate. In this case, our relation becomes the following:

diff --git a/lib/gitlab/import_export/project/import_export.yml b/lib/gitlab/import_export/project/import_export.yml
index 43d66e0e67b7..0880a27dfce2 100644
--- a/lib/gitlab/import_export/project/import_export.yml
+++ b/lib/gitlab/import_export/project/import_export.yml
@@ -122,6 +122,7 @@ tree:
         - label:
           - :priorities
     - :service_desk_setting
+    - documents:
       - :award_emoji
       - notes:
         - :award_emoji
   group_members:
     - :user

Add included_attributes of the relation. By default, any relation attribute that is not listed in included_attributes of the YAML file are filtered out on both export and import. To include the attributes you need, you must add them to included_attributes list as following:

diff --git a/lib/gitlab/import_export/project/import_export.yml b/lib/gitlab/import_export/project/import_export.yml
index 43d66e0e67b7..dbf0e1275ecf 100644
--- a/lib/gitlab/import_export/project/import_export.yml
+++ b/lib/gitlab/import_export/project/import_export.yml
@@ -142,6 +142,9 @@ import_only_tree:

 # Only include the following attributes for the models specified.
 included_attributes:
+  documents:
+    - :title
+    - :description
   user:
     - :id
     - :public_email

Add excluded_attributes of the relation. We also have excluded_attributes list present in the file. You don't need to add excluded attributes for Project, but you do still need to do it for Group. This list represent attributes that should not be included in the export and should be ignored on import. These attributes usually are:
- Anything that ends on _id or _ids
- Anything that includes attributes (except custom_attributes)
- Anything that ends on _html
- Anything sensitive (e.g. tokens, encrypted data)
See a full list of prohibited references here.
Add methods of the relation. If your relation has a method (for example, document.signature) that must also be exported, you can add it in the methods section. The exported value will be present in the export and you can do something with it on import. For example, assigning it to a field.

For example, we export return value of note_diff_file.diff_export method and on import set note_diff_file.diff to the exported value of this method.

Add test coverage for new relation

Because the direct transfer uses the file-based importer under the hood, we must add test coverage for a new relation with tests in the scope of the file-based importer, which also covers the export side of the direct transfer importer. Add tests to:

spec/lib/gitlab/import_export/project/tree_saver_spec.rb. A similar file is available for Group.
ee/spec/lib/ee/gitlab/import_export/project/tree_saver_spec.rb for EE-specific relations.

Follow other relations example to add the new tests.

Verifying added relation is exporting as expected

Any newly-added relation specified in import_export.yml is automatically added to the export files written on disk, so no extra actions are required.

Once the relation is added and tests are added, we can manually check that the relation is exported. It should automatically be included in both:

File-based imports and exports. Use the project export functionality to export, download, and inspect the exported data.
Direct transfer exports. Use the export_relations API to export, download, and inspect exported relations (it might be exported in batches).

Export a binary relation

If adding support for a binary relation:

Create a new export service that performs export on disk. See example BulkImports::LfsObjectsExportService.
Add the relation to the list of file_relations.
Add the relation to BulkImports::FileExportService.

Example

Import on destination

As mentioned above, there are three kinds of relations in direct transfer imports:

NDJSON-exported relations, downloaded from the export_relations API. For example, documents.ndjson.gz.
GraphQL API relations. For example, members information is fetched using GraphQL to import groupand project user memberships.
Binary relations, downloaded from the export_relations API. For example, lfs_objects.tar.gz.

Because the direct transfer importer is based on the Extract/Transform/Load data processing technique, to start importing a relation we must define:

A new relation importing pipeline. For example, DocumentsPipeline.
A data extractor for the pipeline to know where and how to extract the data. For example, NdjsonPipeline.
A list of transformers, which is a set of classes that are going to transform the data to the format you need.
A loader, which is going to persist data somewhere. For example, save a row in the database or create a new LFS object.

No matter what type of relation is being imported, the Pipeline class structure is the same:

module BulkImports
  module Common
    module Pipelines
      class DocumentsPipeline
        include Pipeline

        def extract(context)
          BulkImports::Pipeline::ExtractedData.new(data: file_paths)
        end

        def transform(context, object)
          ...
        end

        def load(context, object)
          document.save!
        end
      end
    end
  end
end

Importing a relation from NDJSON

Defining a pipeline

From the previous example, our documents relation is exported to NDJSON file, in which case we can use both:

NdjsonPipeline, which includes automatic data transformation from a JSON to an ActiveRecord object (which is using file-based importer under the hood).
NdjsonExtractor, which downloads the .ndjson.gz file from source instance using the /export_relations/download REST API endpoint.

Each step of the ETL pipeline can be defined as a method or a class.

  class DocumentsPipeline
    include NdjsonPipeline

    relation_name 'documents'

    extractor ::BulkImports::Common::Extractors::NdjsonExtractor, relation: relation
end

This new pipeline will now:

Download the documents.ndjson.gz file from the source instance.
Read the contents of the NDJSON file and deserialize JSON to convert to an ActiveRecord object.
Save it in the database in scope of a project.

A pipeline can be placed under either:

The BulkImports::Common::Pipelines namespace if it's shared and to be used in both Group and Project migrations. For example, LabelsPipeline is a common pipeline and is referenced in both Group and Project stage lists.
The BulkImports::Projects::Pipelines namespace if a pipeline belongs to a Project migration.
The BulkImports::Groups::Pipelines namespace if a pipeline belongs to a Group migration.

Adding a new pipeline to stages

The direct transfer importer performs migration of groups and projects in stages. The list of stages is defined in:

For Project: lib/bulk_imports/projects/stage.rb.
For Group: lib/bulk_imports/groups/stage.rb.

Each stage:

Can have multiple pipelines that run in parallel.
Must fully complete before moving to the next stage.

Let's add our pipeline to the Project stage:

module BulkImports
  module Projects
    class Stage < ::BulkImports::Stage
      private

       def config
        {
          project: {
            pipeline: BulkImports::Projects::Pipelines::ProjectPipeline,
            stage: 0
          },
          repository: {
            pipeline: BulkImports::Projects::Pipelines::RepositoryPipeline,
            maximum_source_version: '15.0.0',
            stage: 1
          },
          documents: {
            pipeline: BulkImports::Projects::Pipelines::DocumentsPipeline,
            minimum_source_version: '16.11.0',
            stage: 2
          }
       end
    end
  end
end

We specified:

stage: 2, so project and repository stages must complete first before our pipeline is run in stage 2.
minimum_source_version: '16.11.0'. Because we introduced documents relation for exports in this milestone, it's not available in previous GitLab versions. Therefore so this pipeline only runs if source version is 16.11 or later.

NOTE: If a relation is deprecated and need only to run the pipeline up to a certain version, we can specify maximum_source_version attribute.

Covering a pipeline with tests

Because we already covered the export side with tests, we must do the same for the import side. For the direct transfer importer, each pipeline has a separate spec file that would look something like this example.

Example

Importing a relation from GraphQL API

If your relation is available through GraphQL API, you can use GraphQlExtractor and perform transformations and loading within the pipeline class.

MembersPipeline example:

module BulkImports
  module Common
    module Pipelines
      class MembersPipeline
        include Pipeline

        transformer Common::Transformers::ProhibitedAttributesTransformer
        transformer Common::Transformers::MemberAttributesTransformer

        def extract(context)
          graphql_extractor.extract(context)
        end

        def load(_context, data)
          ...

          member.save!
        end

        private

        def graphql_extractor
          @graphql_extractor ||= BulkImports::Common::Extractors::GraphqlExtractor
            .new(query: BulkImports::Common::Graphql::GetMembersQuery)
        end
      end
    end
  end
end

The rest of the steps are identical to the steps above.

Import a binary relation

A binary relation pipeline has the same structure as other pipelines, all you need to do is define what happens during extract/transform/load steps.

LfsObjectsPipeline example:

module BulkImports
  module Common
    module Pipelines
      class LfsObjectsPipeline
        include Pipeline

        file_extraction_pipeline!

        def extract(_context)
          download_service.execute
          decompression_service.execute
          extraction_service.execute

          ...
        end

        def load(_context, file_path)
          ...

          lfs_object.save!
        end
      end
    end
  end
end

There are a number of helper service classes to assist with data download:

BulkImports::FileDownloadService: Downloads a file from a given location.
BulkImports::FileDecompressionService: Gzip decompression service with required validations.
BulkImports::ArchiveExtractionService: Tar extraction service.