-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Production Data to Power Homepage Visualization #6238
Comments
@scolapasta and I met to review the format discussed in IQSS/dataverse-sample-data#8 and he's going to help me with the SQL (phew!). This is the main visual we were looking at: This is what I had so far, based mostly on work from the Gustavo's previous (longer) script.
On my list is to ask @TaniaSchlatter if "publication date" is supposed to be for the file or the dataset. I'm pretty sure it's supposed to be for the file. |
@scolapasta cooked up some good stuff for me fast so I'm back to my hacking! Thanks!! |
@sekmiller helped me a ton too. Thanks! 🎉 I just emailed the following to Jess: "Subject: 50,000 files, 329,267 files, no subjects Attached please find a zip called files.zip that contains two files:
Please note that unlike what we talked about, subjects are not yet included. I'll work on this next but I thought you might like playing around with the three levels of hierarchy, which is included." I don't think I'll attach the files here because it's production data from Harvard Dataverse. Well, it's about a week old since we refresh our local copy on Sundays. I think I'm properly only including files from datasets that are published but some code review might be nice before we make this data available publicly. Also, I always forget the syntax for using psql to create a TSV file so here it is so I have it handy next week when I start working on subjects:
|
@sekmiller helped me add subjects to the query in e8686c8 and it works fine on my laptop with a small database but it's taking way longer to run in the "copy of production" database. Before I made this change the query was taking a minute or two but now it's still going. I'm heading home for the day but I guess I'll check the query in the morning. I guess I'll attach here the results from my small local database so @TaniaSchlatter or others (Jess) can take a look: devdb.tsv.txt . I'm using a semicolon as a delimiter when there are multiple subjects. |
I used 4b56213 to create "2018.zip" which I sent to @TaniaSchlatter and Jess. Please let me know if you're happy with the data. |
I think I wrote too quickly. I see 3 date columns in the file: File Creation, File Publication, Dataset Publication. I see some files with dates published before the dataset date published. I see a lot of files with a publication date of 2019. I'd like to review the dates in more detail with someone from the team. |
(see above, I take this back :)) Closing, as data looks good (great even) and has been sent to Jess. We will create another issue for the next iteration. |
As we discussed we will add a filter to only select those files that were originally published in 2018. Additionally we will verify the dataset publish date to make sure it makes sense compared to the file publish date. |
By the way, the dataset publication date we are selecting is actually the publication date of the latest version, which explains why in some(many?) cases the date is after the file publication date (which is the original file pub date.) |
In fce1289 I implemented the new requirements (with much help, as always from @sekmiller ) and sent a new file to @TaniaSchlatter and Jess. |
This looks great! Thank you! |
In #5603 we're adding a visualization to the Dataverse home page. In IQSS/dataverse-sample-data#8 (comment) we settled on the format and Jess from Harvard Library requested MORE DATA. Let's provide the sample data to her, in the structure defined in #8.
The text was updated successfully, but these errors were encountered: