Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use EF 8 openjson for inserts #34512

Closed
AndersMalmgren opened this issue Aug 22, 2024 · 5 comments
Closed

Use EF 8 openjson for inserts #34512

AndersMalmgren opened this issue Aug 22, 2024 · 5 comments

Comments

@AndersMalmgren
Copy link

With EF 8 we can do really fast queries with large datasets like

var ids = Enumerable.Range(0, 1000).ToArray();
var foobar = await _ctx.Set<Foo>().Where(s => ids.Contains(s.Id)).ToListAsync();

This is superhelpful for many usecases.
But couldnt this also be used to pass alot of data for inserts
EF inserts are crazy slow compared to a sp with table value parameters.

If you could pass the insert data as json it almost defeats the need for storage procs

@roji
Copy link
Member

roji commented Aug 22, 2024

What kind of usage of OPENJSON are you thinking about for inserts, can you provide a sample? Why do you think that would be more efficient?

Note that in any case, #27333 tracks introducing an efficient bulk insert API to EF, which would use each database driver's efficient bulk import mechanism. That's likely to be far more efficient than any JSON-based solution.

@AndersMalmgren
Copy link
Author

What kind of usage of OPENJSON are you thinking about for inserts, can you provide a sample? Why do you think that would be more efficient?

Note that in any case, #27333 tracks introducing an efficient bulk insert API to EF, which would use each database driver's efficient bulk import mechanism. That's likely to be far more efficient than any JSON-based solution.

With a json parameter and a OPENJSON in the sql there would be a single parameter and a single statement sent to sql. Today there is one insert sent with a bunch of parameters for each element in the insert.

@roji
Copy link
Member

roji commented Aug 22, 2024

Do you have any benchmark or resource showing that to be faster?

@AndersMalmgren
Copy link
Author

AndersMalmgren commented Aug 26, 2024

Do you have any benchmark or resource showing that to be faster?

I did a quick test. Inserting 65k entitis using EF took 40-60 seconds.,

That code looks like

var includedPayments = await _ctx.Set<PaymentTransaction>().Where(p => p.Balanced == false && p.PaymentState == PaymentState.PendingApproval && p.PaymentDirection == PaymentDirection.Outgoing).ToListAsync();

await _ctx.AddAsync(new PaymentTransactionSnapshot
{
    Snapshots = includedPayments.Select(p => new Snapshot { Checksum = p.RowVersion, PaymentTransaction = p }).ToList()
});

For reference I tried to move the entire thing to a SP and did a insert into from a select on the same data which took 200ms

I then did a quick POC like

var includedPayments = await _ctx.Set<PaymentTransaction>().Where(p => p.Balanced == false && p.PaymentState == PaymentState.PendingApproval && p.PaymentDirection == PaymentDirection.Outgoing).ToListAsync();

var json = System.Text.Json.JsonSerializer.Serialize(includedPayments.Select(p => new { PaymentTransactionId = p.Id, p.RowVersion }));

var sql = @"INSERT INTO PaymentTransactionSnapshot (CreatedBy, CreatedUTC, Sent) VALUES ({0}, GETUTCDATE(), 0)

INSERT INTO PaymentTransactionSnapshotPaymentTransaction (Id, PaymentTransactionId, Checksum) 
SELECT SCOPE_IDENTITY(),  JSON_VALUE(value, '$.PaymentTransactionId'), CAST(JSON_VALUE(value, '$.RowVersion') as binary(8)) FROM 
OPENJSON({1}, '$')";

await _ctx.ExecuteRaw(sql, _ctx.Username, json);

Again same data and it took 5 seconds. So alot slower than a sp since the data is available from within the SP with a query. But 5 seconds is really good and about 8 percent of the time with EF Core insert. Also alot of times the data cant be queried from within a sp and have to be provided by the domain.

edit: If I optimize the populating query it goes down to sub second timing

var includedPayments = await _ctx.Set<PaymentTransaction>().Where(p => p.Balanced == false && p.PaymentState == PaymentState.PendingApproval && p.PaymentDirection == PaymentDirection.Outgoing)
    .Select(p => new { p.Id, p.RowVersion}).ToListAsync();

var json = System.Text.Json.JsonSerializer.Serialize(includedPayments.Select(p => new { PaymentTransactionId = p.Id, p.RowVersion }));

var sql = @"INSERT INTO PaymentTransactionSnapshot (CreatedBy, CreatedUTC, Sent) VALUES ({0}, GETUTCDATE(), 0)

INSERT INTO PaymentTransactionSnapshotPaymentTransaction (Id, PaymentTransactionId, Checksum) 
SELECT SCOPE_IDENTITY(),  JSON_VALUE(value, '$.PaymentTransactionId'), CAST(JSON_VALUE(value, '$.RowVersion') as binary(8)) FROM 
OPENJSON({1}, '$')";

await _ctx.ExecuteRaw(sql, _ctx.Username, json);

@roji
Copy link
Member

roji commented Aug 30, 2024

@AndersMalmgren I'm quite confused by the above - you're comparing different things at different levels.

First, for the EF code which takes 40-60 seconds for 65k entities, please submit an actual minimal repro, or at least a full sample. It's not clear what exactly you're measuring, whether you're calling SaveChanges() after each AddAsync, or at end, etc. It's simply impossible to know what exactly it is that you're measuring.

Then, if you'd like to show that inserting something using OPENJSON is somehow faster than inserting without OPENJSON, you need to compare two comparable SQL insertions, and not higher-level EF Add/SaveChanges code with a lower-level raw SQL API. You also need to do this in a more serious way, by using BenchmarkDotNet (again, it's not clear what exactly you're measuring and how).

In any case, I'm not sure why it makes sense to pack all your data in a JSON, and then do INSERT ... SELECT FROM OPENJSON() over that data; what exactly is the advantage there compared to a simpler INSERT with multiple VALUES:

INSERT INTO PaymentTransactionSnapshotPaymentTransaction (Id, PaymentTransactionId, Checksum) VALUES (v1, v2, v3), (v4, v5, v6))

This is the traditional - and also more portable way to do what you're doing here; you'd need to at least provide a reliable benchmark between the two for this to make any sense...

@AndriySvyryd AndriySvyryd closed this as not planned Won't fix, can't repro, duplicate, stale Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants