Skip to content
This repository has been archived by the owner on Sep 1, 2022. It is now read-only.

Convert meet data to incremental additions #112

Merged
merged 4 commits into from
Mar 18, 2021
Merged

Conversation

zkagin
Copy link
Collaborator

@zkagin zkagin commented Oct 21, 2020

Fixes #100

Converts Meet pulls so that it only pulls the latest data.

@zkagin zkagin requested a review from dchess October 21, 2020 03:57
@zkagin zkagin mentioned this pull request Oct 21, 2020
@denglender
Copy link
Contributor

Ran this pr + #99 with --meet. On the first run, got this error:

Loading .env environment variables.
2020-10-26 11:12:39PM Eastern Daylight Time | INFO: Meet: Generating requests...

2020-10-26 11:12:58PM Eastern Daylight Time | ERROR: (pyodbc.Error) ('40001', '[
40001] [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Transaction (Proces
s ID 82) was deadlocked on lock resources with another process and has been chos
en as the deadlock victim. Rerun the transaction. (1205) (SQLExecDirectW)')
[SQL: SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]
FROM [INFORMATION_SCHEMA].[TABLES]
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max)) AN
D [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) ORDER BY
 [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]]
[parameters: ('custom', 'BASE TABLE')]
(Background on this error at: http://sqlalche.me/e/13/dbapi)
Traceback (most recent call last):
  File "C:\Users\SZ\.virtualenvs\google_classroom-a-hnCXe9\lib\site-packages\sql
alchemy\engine\base.py", line 1277, in _execute_context
    cursor, statement, parameters, context
  File "C:\Users\SZ\.virtualenvs\google_classroom-a-hnCXe9\lib\site-packages\sql
alchemy\engine\default.py", line 593, in do_execute
    cursor.execute(statement, parameters)
pyodbc.Error: ('40001', '[40001] [Microsoft][ODBC Driver 13 for SQL Server][SQL
Server]Transaction (Process ID 82) was deadlocked on lock resources with another
 process and has been chosen as the deadlock victim. Rerun the transaction. (120
5) (SQLExecDirectW)')

On second run, got:

Loading .env environment variables.
2020-10-26 11:14:18PM Eastern Daylight Time | INFO: Meet: Generating requests...

2020-10-26 11:21:39PM Eastern Daylight Time | ERROR: The truth value of a DataFr
ame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Traceback (most recent call last):
  File "main.py", line 173, in <module>
    main(Config)
  File "main.py", line 69, in main
    pull_data(config, creds, sql)
  File "main.py", line 162, in pull_data
    Meet(admin_reports_service, sql, config).batch_pull_data(overwrite=False)
  File "C:\Users\SZ\gc_new\google_classroom\google_classroom\timer.py", line 22,
 in wrapper
    results = func(*args, **kwargs)
  File "C:\Users\SZ\gc_new\google_classroom\google_classroom\endpoints\base.py",
 line 268, in batch_pull_data
    request_tuple = self._generate_request_tuple(course_id, date, None, 0)
  File "C:\Users\SZ\gc_new\google_classroom\google_classroom\endpoints\base.py",
 line 172, in _generate_request_tuple
    self.request_data(course_id, date, next_page_token),
  File "C:\Users\SZ\gc_new\google_classroom\google_classroom\endpoints\meet.py",
 line 36, in request_data
    if data:
  File "C:\Users\SZ\.virtualenvs\google_classroom-a-hnCXe9\lib\site-packages\pan
das\core\generic.py", line 1330, in __nonzero__
    f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(),
a.item(), a.any() or a.all().

@zkagin
Copy link
Collaborator Author

zkagin commented Oct 27, 2020

@denglender Sorry about that. The first error seems to be something transient that I've heard about, but have never been able to repro before. If it is coming up frequently, let me know. The second error was a start condition I hadn't checked. It should be fixed now.

Just to clarify, you do not need #99 as well — just this one on top of the main branch should do.

@denglender
Copy link
Contributor

Ran with this update and no errors reported, but doesn't appear to be functioning as expected.

I ran at 10:46 AM, 10:55 AM, and 1:31 PM (After having run a full sync with the old client yesterday).

Looking at the data in the DB, there's a discontinuity in the results
Here's a snapshot of

SELECT *
  FROM [Cust38116].[custom].[GoogleClassroom_Meet]
  WHERE item_time >= {d'2020-10-27'}
  ORDER BY item_time DESC

image
You can see the synced records jump from 14:54 to 16:10 (these are UTC ... all other times listed in this post are UTC-4)

Comparing this to the export from the Audit Log UI, there are 3078 additional records that should have happened between these two time stamps.

Here are the outputs from the runs:

Loading .env environment variables.
2020-10-27 10:46:59AM Eastern Daylight Time | INFO: Meet: Generating requests...

2020-10-27 10:48:17AM Eastern Daylight Time | INFO: Meet: 1 requests remaining.
On page 0.
2020-10-27 10:48:17AM Eastern Daylight Time | DEBUG: Making request: POST https:
//oauth2.googleapis.com/token
2020-10-27 10:48:21AM Eastern Daylight Time | DEBUG: Meet: Queueing next page.
2020-10-27 10:49:39AM Eastern Daylight Time | DEBUG: Meet: received 1000 records
, page 0.
2020-10-27 10:49:39AM Eastern Daylight Time | DEBUG: Meet: processing 1000 recor
ds.
2020-10-27 10:49:39AM Eastern Daylight Time | DEBUG: Meet: inserting 1000 record
s into GoogleClassroom_Meet.
2020-10-27 10:49:41AM Eastern Daylight Time | INFO: Meet: 1 requests remaining.
On page 1.
2020-10-27 10:49:45AM Eastern Daylight Time | DEBUG: Meet: Queueing next page.
2020-10-27 10:51:17AM Eastern Daylight Time | DEBUG: Meet: received 1000 records
, page 1.
2020-10-27 10:51:17AM Eastern Daylight Time | DEBUG: Meet: processing 1000 recor
ds.
2020-10-27 10:51:17AM Eastern Daylight Time | DEBUG: Meet: inserting 1000 record
s into GoogleClassroom_Meet.
2020-10-27 10:51:18AM Eastern Daylight Time | INFO: Meet: 1 requests remaining.
On page 2.
2020-10-27 10:51:18AM Eastern Daylight Time | DEBUG: Meet: received 0 records, p
age 2.
2020-10-27 10:51:18AM Eastern Daylight Time | INFO: EndPoint.batch_pull_data com
pleted in 4.34 minutes.
Loading .env environment variables.
2020-10-27 10:55:17AM Eastern Daylight Time | INFO: Meet: Generating requests...

2020-10-27 10:57:10AM Eastern Daylight Time | INFO: Meet: 1 requests remaining.
On page 0.
2020-10-27 10:57:10AM Eastern Daylight Time | DEBUG: Making request: POST https:
//oauth2.googleapis.com/token
2020-10-27 10:57:12AM Eastern Daylight Time | DEBUG: Meet: received 263 records,
 page 0.
2020-10-27 10:57:12AM Eastern Daylight Time | DEBUG: Meet: processing 263 record
s.
2020-10-27 10:57:12AM Eastern Daylight Time | DEBUG: Meet: inserting 263 records
 into GoogleClassroom_Meet.
2020-10-27 10:57:13AM Eastern Daylight Time | INFO: EndPoint.batch_pull_data com
pleted in 1.95 minutes.
Loading .env environment variables.
2020-10-27 01:31:55PM Eastern Daylight Time | INFO: Meet: Generating requests...

2020-10-27 01:33:25PM Eastern Daylight Time | INFO: Meet: 1 requests remaining.
On page 0.
2020-10-27 01:33:26PM Eastern Daylight Time | DEBUG: Making request: POST https:
//oauth2.googleapis.com/token
2020-10-27 01:33:30PM Eastern Daylight Time | DEBUG: Meet: Queueing next page.
2020-10-27 01:35:14PM Eastern Daylight Time | DEBUG: Meet: received 1000 records
, page 0.
2020-10-27 01:35:14PM Eastern Daylight Time | DEBUG: Meet: processing 1000 recor
ds.
2020-10-27 01:35:14PM Eastern Daylight Time | DEBUG: Meet: inserting 1000 record
s into GoogleClassroom_Meet.
2020-10-27 01:35:15PM Eastern Daylight Time | INFO: Meet: 1 requests remaining.
On page 1.
2020-10-27 01:35:19PM Eastern Daylight Time | DEBUG: Meet: Queueing next page.
2020-10-27 01:37:12PM Eastern Daylight Time | DEBUG: Meet: received 1000 records
, page 1.
2020-10-27 01:37:12PM Eastern Daylight Time | DEBUG: Meet: processing 1000 recor
ds.
2020-10-27 01:37:13PM Eastern Daylight Time | DEBUG: Meet: inserting 1000 record
s into GoogleClassroom_Meet.
2020-10-27 01:37:14PM Eastern Daylight Time | INFO: Meet: 1 requests remaining.
On page 2.
2020-10-27 01:37:14PM Eastern Daylight Time | DEBUG: Meet: received 0 records, p
age 2.
2020-10-27 01:37:14PM Eastern Daylight Time | INFO: EndPoint.batch_pull_data com
pleted in 5.33 minutes.

@zkagin zkagin force-pushed the incremental_meet_data branch 2 times, most recently from bba9cfb to 86d011a Compare October 28, 2020 04:44
@zkagin
Copy link
Collaborator Author

zkagin commented Oct 28, 2020

Ahh, it looks like on subsequent pages, the changed startTime causes the request to return nothing. I just tested it storing the startTime parameter from the first request, and this seems to be working better (I pulled all the data, deleted the last few days, and then pulled again).

Thanks for your patience @denglender. Hoping we have a working solution now.

@denglender
Copy link
Contributor

Sorry for the delay on feedback on this --- it still appears to be missing data after a few runs, but I haven't been able to identify yet exactly the parameters of what's missing. If I run the incremental sync a few times over a the course of a few hours, pull the data, and then run the full sync (i.e. without the patch), the data won't exactly match.

@zkagin
Copy link
Collaborator Author

zkagin commented Nov 16, 2020

Hi @denglender, let me know if you can determine what data might be missing. I've tested by running a fresh pull, deleting some data, and then calling it again to make sure it has the same data, which should have the same effect, but perhaps there's a timezone issue cropping up somewhere? I'll keep playing around with this to see if I can identify the issue you're having as well.

@denglender
Copy link
Contributor

I don't think it's a time zone issue. It looks like there's just some data near the edge of the refresh which isn't getting pulled in. As an example, I ran a full refresh last night, then incremental refreshes today at 11:08 AM, 12:30 PM, and 1:26 PM. I pulled a copy of the whole DB table for today, then re-ran the sync using the full-sync code (i.e. what's currently live in the repo), pulled a copy of the whole DB again and compared.

After eliminating the data from after 1:26 PM, the DB from the incremental syncs has 11262 rows. The DB from the full sync has 11409 rows. Those missing rows are scattered mostly around the 11:08, 12:30, and 1:26 time periods. I've shown them in attached screenshot (yellow rows are rows that are in the full sync but not incremental, data is sorted chronologically with latest at top). In each "grouping" of the missing/yellow rows, here are the ranges that all the yellow/missing rows fall within: 1:06 PM - 1:13 PM, 12:04 PM - 12:18 PM, and 10:42 - 10:56 PM. Not all the rows in those time frames are missing from the incremental data version, but no rows are missing outside of those ranges.

So it kind of looks like there's an approx 15 minute period of time where the data reported by the API is incomplete, plus an additional an approx 15 minute delay in reporting it. i.e. at the 1:26 pull, there was no data reported after 1:13 (~ 15 min), and the data after 1:06 was incomplete (another ~10 minutes in this case, but 15 min in some of the others). I'm not sure how reliable those time gaps are. If they're moderately reliable, then perhaps looking back a little further (e.g. an hour) prior to the most recent DB entry whenever the incremental data is run could do the trick.

matched_rows

@zkagin
Copy link
Collaborator Author

zkagin commented Dec 9, 2020

@denglender Makes sense, and understandable that Google wouldn't have up-to-the-second data available in the API. The safest approach is probably to drop the last day and start from there, as we don't want to rely on how quickly Google processes data and makes it available. I'll take a look at making those changes in the next few days and update this PR with it.

@zkagin
Copy link
Collaborator Author

zkagin commented Dec 14, 2020

@denglender Thanks for your patience. The latest version here should hopefully work. It gets the last date and drops the 24 prior hours, so that it can replace any outdated data with up-to-date data that might've been missing in the last pull. Give that a try and see if this is complete now!

@denglender
Copy link
Contributor

This looks like it's working to me. And solves the original issue. Data pull with new code took 3 minutes, compared to 129 minutes for the full load.

@zkagin
Copy link
Collaborator Author

zkagin commented Jan 6, 2021

Great to hear @denglender!

@dchess, thoughts on merging this in?

@zkagin
Copy link
Collaborator Author

zkagin commented Feb 7, 2021

Rebased and ready to merge whenever you want.

@dchess dchess closed this Mar 18, 2021
@dchess dchess reopened this Mar 18, 2021
@dchess
Copy link
Collaborator

dchess commented Mar 18, 2021

@zkagin Black formatter is showing no errors on my system but the automated tests are failing. I think my local environment is on a different version of black (19.10b0) than the Github Actions runner is. I am going to force merge this for now and then open a separate PR to bump this codebase to Python 3.8.

@dchess dchess merged commit 9f28c16 into main Mar 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incremental Meet Updates
3 participants