Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide API and internal framework for modifying variable-level (tabular) metadata. #4448

Closed
landreev opened this issue Feb 1, 2018 · 12 comments · Fixed by #5971
Closed

Comments

@landreev
Copy link
Contributor

landreev commented Feb 1, 2018

This is the first concrete dev. issue being opened as part of work on #4174; based on the discussion with the Scholars Portal team (see the notes in that issue).
The plan is to create an (external) tool that will allow the data owner to modify data variable metadata (for example: add weights to variables; change variable type; add categorical labels... etc. in order to make the metadata more descriptive and/or accurate).

On the Dataverse end, we need to provide an API to accept the updated version of the metadata (in DDI/xml format) and save it in the database permanently.

Creating this API endpoint would be trivial; so is parsing variable-level DDI (we already have legacy code for this, that we inherited from the DVN project). The main technical challenge is that, as of now, our tabular (data variable) metadata are immutable. The DataVariable objects are created during tabular ingest; they are linked directly to DataFile objects (bypassing versionable FileMetadata hierarchy). With no mechanism provided for modifying the information in this objects.

So the first task is to make the DataVariable metadata versionable. We may need to invest some thought into how we want to achieve this. A trivial solution would be to simply replicate what we are doing with all the other metadata. By simply linking the DataTable->DataVariable hierarchy to FileMetadata (instead of DataFile); thus allowing multiple versions of DataVariables associated with the same DataFile; the same way the same DataFile can have different file names in different versions.
HOWEVER, this system is fairly wasteful by design. If you have a dataset and you modify a single metadata field - fix a typo in the title, for example - we always create a new DatasetVersion, that duplicates ALL THE METADATA fields that existed in the previous version (not just the changed title!), including all the FileMetadata information associated with every file in the dataset. Making all the variable-level metadata subject to this automatic duplication will increase the size of the database by on order of magnitude or worse, I believe. We probably want to avoid this, by designing some mechanism for creating new versions of these variable-level metadata only when there is an actual change. We should be able to reuse the same DataVariable objects between different versions, until an actual change is made - and then we create a new one. This would warrant a dedicated architecture discussion.

Aside for that, all very doable.

@kevinworthington
Copy link
Contributor

Thank you for creating this ticket @landreev and summarizing a very valid concern. Only updating that which has changed is definitely the way to go.

The workflow I see this API supporting when a call is made with revised metadata is; first a comparison is made against the system copy. Then anything that has been changed is saved, as well, any new elements added and removed elements deleted.
An example dataset to start looking at what kinds of changes we are hoping to make and store can be seen by looking at the metadata for the file http://odesi1.scholarsportal.info/webview/index.jsp?object=http://142.150.190.11:80%2Fobj%2FfStudy%2Farg-drones-E-2014-can&mode=documentation&v=2&top=yes
by clicking the disk icon, then choosing "in XML format" under the "Download the documentation" section.
And comparing it to the default metadata generated by Dataverse at https://dataverse.scholarsportal.info/ddi_explore/index.html?uri=https://dataverse.scholarsportal.info/api/access/datafile/8988/metadata/ddi&
by choosing "variable metadata" from the "Download" dropdown.

@landreev
Copy link
Contributor Author

landreev commented Feb 1, 2018

I guess I should note that there is also a possibility of making the variable metadata editable, but not versionable: keep the DataTable -> DataVariable hierarchy directly attached to the DataFile, and modify it directly, without keeping track of the changes.
Not sure if this is at all acceptable, but would be easy to achieve.

@kevinworthington
Copy link
Contributor

kevinworthington commented Feb 27, 2018

Hi @landreev,
I just had a chance to speak with Amber and other members of our team about the option to have the DDI update-able but not versionable. The consensus is that this would be acceptable in the short-term, but in the long-term it should really be versionable.

@pdurbin
Copy link
Member

pdurbin commented May 25, 2018

Heads up that over at #4716 (comment) @kevinworthington just linked to a new git repo at https://github.com/scholarsportal/Dataverse-Data-Curation-Tool that he seems to be hard at work on for #4174 so we should probably build these APIs so he can use 'em when he's ready. 😄

@amberleahey
Copy link

hello! Just checking in, we have a new developer on staff at SP, Victoria Lubitch started a few weeks ago and will be resuming dev of the DCT tool. We have an ongoing list of issues to work on, including integration with DV workflow and code. I'll share a mock-up of the architecture diagram workflow we recently discussed for feedback / input shortly.

Can work continue on the API to write DDI at the variable-level back to database? I think this was also requested for Jon @odum for another related project too. Having this be connected to the DDI XML stored in DV would be awesome --- already envisioning all sorts of use cases!! DV generated codebooks and data dictionaries here we come!! :)

@pdurbin
Copy link
Member

pdurbin commented Sep 24, 2018

@amberleahey great news! I just moved this issue to "Community Dev" at https://waffle.io/IQSS/dataverse .

For now I'd like to assign you to this issue so I just invited you to join https://github.com/orgs/IQSS/teams/dataverse-readonly so you can be assigned. It would be great to add Victoria to that group as well so if you know her GitHub username already, please pass it along.

I believe the related project you're talking about is Trusted Remote Storage Agent (TRSA). I don't think there's an issue for this yet but pull request #4750 is how we're tracking it. /cc @jonc1438 @akio-sone

@lubitchv
Copy link
Contributor

lubitchv commented Sep 24, 2018

Hi Philip. My github username is lubitchv

@pdurbin
Copy link
Member

pdurbin commented Sep 24, 2018

@lubitchv thanks! I just sent you an invite to join https://github.com/orgs/IQSS/teams/dataverse-readonly . Please feel free to stop by http://chat.dataverse.org if you're a chat person. 😄 I recommend subscribing to https://groups.google.com/forum/#!forum/dataverse-dev because that's where we post notes about keeping one's dev environment in working order.

@pdurbin
Copy link
Member

pdurbin commented Jan 4, 2019

@lubitchv Happy New Year! I'm just checking in to see if you're blocked or if there's anything you need.

@lubitchv
Copy link
Contributor

lubitchv commented Jan 7, 2019

@pdurbin Happy New Year! I am trying to understand how versioning is done for datasets in dataverse and how we can replicate it with variable level metadata. It would be nice to have a meeting to discuss what is the best way to do it.

@pdurbin
Copy link
Member

pdurbin commented Jan 14, 2019

@lubitchv I'm not sure who should be at a meeting like that but I just thought I'd mention that sometimes developers like yourself bring questions to our community calls. We have one tomorrow (though it conflicts with a pizza party 🍕 ): https://dataverse.org/community-calls

@pdurbin
Copy link
Member

pdurbin commented Mar 28, 2019

Awesome progress on this issue.

Merged:

Next up: #5671

@lubitchv thanks for all your hard work on this!

@lubitchv lubitchv mentioned this issue Jun 25, 2019
3 tasks
dlmurphy added a commit to lubitchv/dataverse that referenced this issue Jun 26, 2019
Formatted some internal and external links
pdurbin added a commit that referenced this issue Jun 28, 2019
pdurbin added a commit that referenced this issue Jun 28, 2019
pdurbin added a commit that referenced this issue Jun 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants