Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk import for efficient importing of data from the client into the database #27333

Open
roji opened this issue Feb 1, 2022 · 9 comments
Open

Comments

@roji
Copy link
Member

roji commented Feb 1, 2022

Databases usually provide various mechanisms for efficient bulk import of data from the client (e.g. .NET application) into the database. We already have issues tracking improving INSERT performance (#15059, #9118, #10443), but bulk import is a specialized scenario where the native database mechanisms work more efficiently (and often considerably so) than multiple INSERTs. Note also that this is different from bulk copy of data across tables within the database (INSERT ... SELECT, tracked by #27320).

Bulk import only allows targeting a table, so a simple method on DbSet should suffice (LINQ operators don't make sense here):

ctx.Blogs.ImportFrom(blogs);

ImportFrom should accept an IEnumerable parameter; ImportFromAsync should have overloads accepting both IEnumerable and IAsyncEnumerable parameters. The method would pull entity instances and import them into the database table via the provider-specific mechanism.

Additional notes

  • EF Core can provide a default implementation which just uses regular INSERTs (without tracking (#9118) and possibly with other optimizations). This would make the API always work, regardless whether a provider has provided a special implementation for it.
  • This method would not be tracking. This typically isn't what's needed when bulk-importing, and the optimized import mechanisms don't typically allow reading back generated values (i.e. the IDs).
  • Most databases also provide bulk import mechanisms which take CSV or other textual data. While this may be used under the hood (implementation detail), we wouldn't expose a user-facing API that deals with textual data; the focus is on an API where the user provides entity instances. An EF Core API accepting textual data would have no added value beyond just using the API directly.
    • Similarly, some implementations allow importing data from a file on the database server - we again wouldn't expose an API for this as EF Core has no added value here.
  • We may want to provide some sort of hook for allowing users to pass provider-specific parameterization of the import process (e.g. SqlBulkCopy allows controlling the batch size, the timeout...). For example, the SQL Server provider could expose an additional overload accepting a custom SqlServerBulkCopyOptions, and there would be a way for that overload to pipe the user-provided options down to the provider's implementation.
  • Naming-wise, the word "import" is used because "copy" seems more ambiguous with copying across tables in the database (#27320).

Database support

  • SQL Server - takes DataTable or IDataReader as input.
  • PostgreSQL (Npgsql) - specialized API for binary import.
  • MySql
    • MySqlConnector - similar to SQL Server's SqlBulkCopy (but seems to support only DataTable/IEnumerable, no IDataReader).
    • Official driver (seems to be CSV/tab-delimited only).
  • SQLite: no special bulk import API apparently, but this dude wrote some tips on making inserts faster. It may make sense to wrap that in a bulk import implementation.
  • Cosmos. This is a good reason to have this abstraction in core rather than relational.

Community implementations

@AndriySvyryd
Copy link
Member

We might need an overload with anonymous types for shadow properties.

Note that this would also allow inserts for keyless entity types. So we might need a non-collection overload too.

@roji
Copy link
Member Author

roji commented Feb 2, 2022

We might need an overload with anonymous types for shadow properties.

Yep..

Note that this would also allow inserts for keyless entity types. So we might need a non-collection overload too.

Can you elaborate?

@AndriySvyryd
Copy link
Member

Can you elaborate?

Just some sugar:

ctx.KeylessBlogs.ImportFrom(new KeylessBlog());

@roji
Copy link
Member Author

roji commented Feb 2, 2022

Ah I see - you're saying this would be the only way to insert keyless entity types (because we don't support them in Add/SaveChanges), so it makes sense to allow "bulk import" of a single instance... Right.

@roji
Copy link
Member Author

roji commented Feb 20, 2022

Note: we have #9118 for optionally not tracking after SaveChanges, which would unlock using SqlBulkCopy for standard SaveChanges. At that point, the advantage of a dedicated bulk import API (this issue) becomes bypassing the change tracking machinery, which may or may not be worth it.

@yzorg
Copy link

yzorg commented Jun 24, 2022

Vote for taking a very close look at EFCore.BulkExtensions. The most underrated .net OSS libraries I use.

It already respects EFCore table name mappings and column/property mappings. It supports all the "batch" scenarios: "upsert", insert only (only new), and "delete if missing" (MERGE with DELETE clause). I've needed all those scenarios. All way faster than current EFCore for 10k+ rows (aka importing a full dataset daily/monthly/x). It supports pulling back new identity values into entities, but off by default, which I think is sensible for bulk scenarios. But if you can reuse roji's RETURNING work here might speed it up.

@roji
Copy link
Member Author

roji commented Jun 26, 2022

@yzorg thanks, yeah - we're aware of EFCore.BulkExtensions.

The plan specifically in this issue is only to wrap the database's native bulk import mechanism - SqlBulkCopy for SQL Server, binary COPY for PostgreSQL, etc. These lower-level API generally allow simply copying in large quantities of data into a table in the most efficient manner possible.

Various "merge" functionality such as upserting, deleting if missing, etc. are higher-level functionality which generally isn't covered by a database's bulk import mechanism, and are implemented by using additional database functionality. For example, it's typical to first use bulk import to efficiently copy data into a temporary table (e.g. via SqlBulkCopy), and then use MERGE to perform upsert/delete-if-missing/whatever between the temp table and the target table.

Such "compound" operations would be out of scope for this specific issue, but it would provide the first building block (bulk import). You could then use raw SQL to express whatever MERGE operation you want (we could possibly even provide an EF API for that, though I'm not sure it would be very valuable).

@khteh
Copy link

khteh commented Feb 17, 2023

https://www.npgsql.org/doc/copy.html#binary-copy is rather crude. I need to ingest tens of thousands of record from Excel sheet into the DB without any duplicate.

@roji
Copy link
Member Author

roji commented Feb 18, 2023

@khteh in what way is Npgsql's binary copy support crude? Deduplicating isn't a concern of bulk import in itself - that's something that needs to be done at a higher level (and therefore wouldn't be covered here in any case). More details on exactly what you need could help clarify things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants