Skip to content
This repository has been archived by the owner on Sep 1, 2022. It is now read-only.

Archive StudentSubmissions before refresh #91

Open
dchess opened this issue Aug 3, 2020 · 19 comments
Open

Archive StudentSubmissions before refresh #91

dchess opened this issue Aug 3, 2020 · 19 comments
Assignees
Labels
enhancement New feature or request

Comments

@dchess
Copy link
Collaborator

dchess commented Aug 3, 2020

If a student is removed from a course all of the submission data for that course is lost. We should create a snapshot of prior day data before truncating the table.

Possible logic:

  • Before truncating StudentSubmission table, append data to archive table
  • Refresh StudentSubmission table
  • Remove any matching assignment (by submission id) from archive table that exists in current table

This logic would also help maintain year over year data as well. This is potentially something we should consider for other data sources: teachers, students, courses, coursework, etc.

@dchess dchess added the enhancement New feature or request label Aug 15, 2020
@dchess
Copy link
Collaborator Author

dchess commented Aug 19, 2020

@zkagin Can we prioritize this piece of logic before moving forward more on the syncing logic?

@zkagin
Copy link
Collaborator

zkagin commented Aug 19, 2020

Yep. Just to confirm, the end result we want is an ever-growing list of deleted StudentSubmissions, correct? And built so that we can use this logic for any of the endpoints as needed?

@dchess
Copy link
Collaborator Author

dchess commented Aug 19, 2020

@zkagin Yeah. I think we'd want to maintain a list of data that has previously come through so the truncation of the main table doesn't result in data loss. @denglender also let me know today that the Meets endpoint only stores the last 6 months of data. So we will want it for that as well. I think putting this logic in place for all endpoints makes sense as well since even a year-to-year rollover when the env var for start date is changed, would impact the historic data.

What are your thoughts on a separate archive table vs. removing truncation and using a process that appends and then deletes duplicate records (keeping the most recent)? That would eliminate the need to maintain more tables.

@zkagin
Copy link
Collaborator

zkagin commented Aug 19, 2020

Yep, the cleanest way is probably a soft delete pattern, where there's an additional "deleted" column. The downside is then we start needing to deal with schema migrations, since we are no longer dropping the table after each run. The first migration can be a simple if statement, but future ones probably should be handled more robustly with something like Alembic.

The diffing logic I sent in #98 would actually be pretty useful here. The steps would be to pull the latest data, diff it to the DB, change the relevant rows to deleted=True, and then append the new rows.

Let me know which approach you'd prefer.

@zkagin
Copy link
Collaborator

zkagin commented Aug 20, 2020

@dchess Just checking in on this again — let me know if you like the described approach or if you had something else in mind.

@dchess
Copy link
Collaborator Author

dchess commented Aug 21, 2020

@zkagin Thinking about this more, I'm wondering if the deleted column is necessary? Shouldn't IDs be unique? Couldn't we append then remove duplicate IDs preserving the matching ID with the most recent updateTime?

@zkagin
Copy link
Collaborator

zkagin commented Aug 21, 2020

Yep, if you don't need to know which student submissions are from deleted / removed students, then we shouldn't need a delete column. That might be a short-term answer to it.

I think checking for differences and then doing adds/updates/deletes to the DB is better than appending and then deduping both from a stability and performance perspective (minimizing DB writes in both cases). I can start building off of #98 and use that to build this for StudentSubmissions.

@dchess
Copy link
Collaborator Author

dchess commented Aug 21, 2020

@zkagin That makes sense to me (minimizing db writes) but I'm also concerned about memory limitations by keeping the diffing in local memory. I've been hitting some memory thresholds on the coursework and submissions endpoints lately.

If the diffing could be done at the batch level and minimized the dataframes held in memory before writing the inserts then I could see that approach working.

@zkagin
Copy link
Collaborator

zkagin commented Aug 21, 2020

Thinking more, you could probably infer that something is deleted by seeing how recent the updateTime is, but that feels pretty non-explicit if that information is relevant (for either StudentSubmission or for any other Endpoints we eventually want this functionality for). Either way, as a first-pass solution this should work without changing any DB schema.

@zkagin
Copy link
Collaborator

zkagin commented Aug 21, 2020

@dchess Are you seeing any performance issues when you run it on your own machine, or is the production environment more limited? I can try and simulate it on my own machine to make sure memory isn't an issue.

@dchess
Copy link
Collaborator Author

dchess commented Aug 21, 2020

@zkagin I'm not sure its a delete in all cases. In some it's more about preserving previously pulled data (ie when the start date changes).

Memory has been an issue in production, but we are running slim VMs (2-4GB). Could be this job requires more resources.

@zkagin
Copy link
Collaborator

zkagin commented Aug 21, 2020

@dchess Makes sense then, let's do it without a delete, and can add additional information later if need be.

If memory is in an issue, we will need to handle it for writing back to GC anyways, so we can do batching by ID if need be in a pretty straightforward manner.

@zkagin
Copy link
Collaborator

zkagin commented Aug 21, 2020

@dchess I did some playing around with this, and it looks like pandas only really likes to write/append to SQL, not delete or update. So we have a few options with some tradeoffs:

  • Loop through individual items to do updates or deletes from pandas data (this might have some performance implications if there are a ton of items to update, but may not be bad at all if it is run more frequently).
  • Follow what you mentioned earlier by appending all the new data, and then execute a SQL command to do the deduping and deleting. I usually try to avoid writing SQL in code, but this might be a case to warrant it.
  • Hold all of the incoming data in memory, merge it with the existing table, drop the table, and then rewrite all of it together. This might have memory issues if it has to hold all submissions data in memory.
  • Some combination of 2 and 3, where you append all of the new data, then read it out and merge it, dedupe it, drop the table and rewrite it all together. This would similarly have memory issues if the 3rd option has it.

What do you think?

@zkagin
Copy link
Collaborator

zkagin commented Aug 21, 2020

@dchess One other idea to build on the 1st/2nd would be to write the updated/new data to a temp table, then write a shorter SQL query to upsert the data into the existing table.

@dchess
Copy link
Collaborator Author

dchess commented Aug 21, 2020

@zkagin check the docs for sqlsorcery I have some examples of how to do updates/deletes by dipping into sqlalchemy functions.

@zkagin
Copy link
Collaborator

zkagin commented Aug 21, 2020

@dchess I looked around and saw options for dropping/truncating a table and for returning a SQLAlchemy table object to run commands, but couldn't find anything for doing batch updates/deletes from a dataframe. Could you point me to what you were thinking of?

@zkagin
Copy link
Collaborator

zkagin commented Aug 21, 2020

Ahh yes. Unfortunately I think we'd still have the problem that there's no way to summarize which rows need to be updated? If it is an arbitrary set of rows, it seems we still need to iterate through each item to be updated and call it separately.

It now occurs to me though that if we can batch the execution of it, this may be totally ok. I would just want to avoid having to write and commit each one of these separately.

Do you know if there's any other way to batch update arbitrary rows from dataframes?

@dchess
Copy link
Collaborator Author

dchess commented Aug 21, 2020

@zkagin One approach would be to:

  • append new records to the table
  • query all records with with duplicate IDs and their updateTime
  • delete records with the MIN updateTime if ID in that dataset

Another approach:

  • API pulls GC data into df
  • Query table for any matching IDs
  • Delete those IDs from table
  • Insert all new records

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
2 participants