Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Planning for FASP Hackathon #98

Open
briandoconnor opened this issue May 4, 2020 · 11 comments
Open

Planning for FASP Hackathon #98

briandoconnor opened this issue May 4, 2020 · 11 comments
Assignees
Labels
Federated Federated Analysis Systems Project Priority: Medium

Comments

@briandoconnor
Copy link
Contributor

briandoconnor commented May 4, 2020

  • brainstorm use cases
  • try to stand up part or all of golden demo
  • hack brining together data from different sources
  • hack together systems/stacks using these APIs (WES, DRS, etc)
  • work on other topics/dev items
  • Ann has a nice use case of datasets in Asia and wanting WES on top of these datasets to send the algo to the remote system
@briandoconnor briandoconnor added Federated Federated Analysis Systems Project Needs Champion Priority: Medium labels May 4, 2020
@ianfore
Copy link
Collaborator

ianfore commented Jun 1, 2020

Willing to carry forward the background work we did on the hackathon.

Ann's compute using Cloud OS seems a good scientific driver. Hackathon activity might be to do this in a FASP environment.

Also use cases of querying dbGaP data from multiple studies (different diseases, funders) according to study specific attributes. Compute on combined data.

@briandoconnor
Copy link
Contributor Author

@ianfore I'm happy to help organize as well. I think there's a lot of interesting work that we can deep dive into. Main question I have right now is when do we have the online hackathon? Can we pick a date and that should hopefully focus our planning.

@briandoconnor
Copy link
Contributor Author

Discussing with Ian and group

  • organize ideas of what we're working on and drive our direction
  • see comments above for ideas
  • Next step, 1) logistic support (Rishi?) 2) pick dates, 3) survey for topics

@rishidev
Copy link

Doodle poll for times/dates at https://doodle.com/poll/htekearz4mmwe8ri

@ianfore
Copy link
Collaborator

ianfore commented Jun 17, 2020

Copying key points in from Rishi's email so we have the background here for subsequent discussion:
Two broad themes to possible topics for hackathon

  • "what needs to be done to make the tools more usable"
  • "let's solve a scientific problem"

We will decide on the exact nature of what these will be and when they will be once people have indicated when they are available. One eye for these sessions will be to have something usable to prepare for GA4GH Plenary.
So there are two actions to take

  1. please go to the Doodle Poll and indicate availability, by the 25 June 2020
  2. Go to the Issue 98 on GitHub (i.e. here) and indicate a scientific problem you would like to tackle or a toolchain usability problem you feel would aid the project, or link to an existing issue on the project board.

@ianfore
Copy link
Collaborator

ianfore commented Jun 17, 2020

I'm working up some scientific and toolchain problems to put forward, but in the meantime wanted to share some possible ideas about how we structure the hackathon. I hope these ideas let us better address the two themes above, and also the priority list discussed on 6/15/20.

There are two characteristics in hackathons I've attended that can help us

  • People tend to work in sub-groups on particular topics
  • Hackathons are endurance events

Even with a sub-group people break out to work on something individually then come back together when they have something to show and get feedback on.

The hackathon, rather than a three hour event, could extend over a few days. A week? No one would be on it full time for that period; groups/individuals would schedule what works for them within the framework. It would still be important to retain the sense of one collective event though - with intro and wrap-up.

@uniqueg
Copy link

uniqueg commented Jun 17, 2020

Hi everyone! Great idea about the hackathons!

Something that's bothering us at ELIXIR is the lack of callbacks between client <-> WES <-> TES for updating workflow and task states. We have been hacking at this, but our only solution is very unsatisfactory and certainly not interoperable given lack of spec support. There has been an issue (ga4gh/task-execution-schemas#121) about this on the TES board opened by @susheel many moons ago, as well as a suggested corresponding change (ga4gh/workflow-execution-service-schemas#133) on the WES board (of which I am critical as stated in that issue as well as below). Nevertheless, I think it is something that concerns cross-API planning, so may be suitable to look at in the context of FASP.

Use case

Callbacks from server to client would be useful both for WES and TES, with a very similar structure, in replacing intermittent, wasteful polling of the respective log/status endpoints in order to keep track of the state of a workflow/task run.

Proposed general flow of a callback

  1. As suggested by @susheel, a callback_url or similar property could be - optionally - attached to a workflow or task run request by the client.
  2. If a request contains a callback_url it should heed it and send status updates to the callback_url; the callback request (probably POST or PATCH) should contain, at the very least, the ID of the run/task in question, as well as the new state.
  3. An endpoint accepting such a request should be implemented at the callback_url (likely, but not necessarily, within the client app that sent the original workflow/task run request).
  4. The service hosting the callback_url endpoint should update (or trigger the update) of the run/task state accordingly.

Points to discuss/consider

  • Although I don't have a specific use case in mind, there may be scenarios where the callback_url points to a different application than the one that originally sent the workflow/task run request. Likewise, there may be cases where a run/task state update should be broadcast to more than one recipient (i.e., callback_url should be a list).
  • A mechanism should probably be in place that prevents arbitrary clients from sending requests to the callback URLs. One such mechanism would be to pass a secret generated by the original client to the WES or TES, which could then in turn be attached to the callback, allowing it to be verified by the service receiving the callback; this solution of course conflicts with the different/multiple recipient approach.
  • Should a callback reception endpoint be explicitly added to the WES specs (e.g., in POST/PATCH /run/{run_id}/status), given that WES is a frequent client for TES? Personally, I think not, because of the inconsistency it creates, as we have limited control over how a client behaves (e.g., a web portal sending requests to WES vs WES sending requests to TES vs some other client sending a request to TES).
  • How/where can we specify the schema definition for the endpoint receiving the callback? As a comment in WES or TES? In an external spec, similar to /service-info?
  • Is there something we can/want to prescribe to ensure a high level of fidelity of the system? For example, if a callback is lost, a client might falsely assume that a run/task is in the wrong state (e.g., there may be a modeling workflow that is already running for days but the client still thinks it's in state QUEUED). Should it be the responsibility of the client to fall back to polling in such a case or should the service (now acting as a client) keep on retrying until the connection resumes? Or should it rather stop the run/task altogether?
  • Are there other inter-service or client-service dependencies in the Cloud WS that would benefit from a similar mechanism? If so, could they be addressed with the same mechanism? If not, what other requirements with respect to the model of the callback request would there be?
  • What other things can people think of regarding the flow, fidelity, security etc. of the proposed flow?

@ianfore
Copy link
Collaborator

ianfore commented Jun 23, 2020

Discussing potential hackathon topics/activities arising from #94 concerned with API feedback.

The API Feedback document contains a number of items, for DRS at least, that might be explored in the hackathon.

  • Using the CRDC DCF DRS service test whether a DRS service can operate in a mode where egress charges are incurred by the data user and not by the repository.
  • Write a workflow in the SB stack which uses DRS to find out where an object is and then run a compute at that location to avoid download/egress charges.
  • Same as previous but the workflow under TES/WES.
  • Test out a schema searchable via Discovery Search prototype which maps logical level ids (use INSDC ids) to immutable DRS objects. May be able to use the NCBI implementation of SRA in BigQuery.
  • Test that the type attribute in DRS can successfully be used to indicate content type and that a workflow can act upon it to handle the content according to data type. Use data from NCBI SRA data locator which contains multiple types.
  • Test out unpacking of DRS Bundles, including objects of different types.
  • Compare the above with Research Objects to explore if RO would provide a solution for bundling that could be reused in DRS.

Organizational suggestion: Adding these items as GitHub Issues would allow them to be handled within the structure of the Hackathon or at other times. They could also be managed on a Hackathon Board in GitHub.

@ianfore
Copy link
Collaborator

ianfore commented Jun 23, 2020

Adding some hackathon activities which are more workflow related. Will add/edit as needed.

  • Anne's example of performing analysis on data combined from Japan and other sources. Contacted Anne for details.
  • Access data via DRS APIs from the CloudOS platform and perform compute.
  • Deploy DNAStack WES service on another Google Cloud Stack e.g. ISB-CGC. DNAStack WES wraps Google Genomics APIs so may not require any implementations work beyond authorization.

@ianfore
Copy link
Collaborator

ianfore commented Jul 2, 2020

Followed up with Anne on the use case mentioned above.
To summarize:

  1. in terms of current activities this use case ties in best to the various Beacon activities (Beacon/Beacon Network/Beacon 2.0).
  2. Within those parameters it would be a good hackathon activity.
  3. The 90 mins Anne and I spent working this through was illustrative about what a hackathon activity would be
  4. A hackathon activity on Cloud OS/NextFlow would also be useful - but needs another use case

Some detail:
1. Tie in to the various Beacon activities (Beacon/Beacon Network/Beacon 2.0)
The use case derives from this paper which describes a rare mutation leading to pediatric intracranial germ cell tumors. The prevalence of these tumors is "5–8 fold greater in Japan and other East Asian countries than in Western countries". This drives the need for global interoperability.

There is a known mutation at a specific location. So a Go Fish exercise is relevant to identify who else has cases with this rare mutation. That led us to Beacon/Beacon Network/Beacon 2.0. We began to explore how Beacon Network might be used in this way but were unsuccessful. 2 and 3 below show how this might be explored in a hackathon.

More broadly, At this point we are not aware of specific databases that are available online to provide data from such cases and which would allow compute to be performed on them. i.e. there doesn't seem to be an immediate potential from this scientific use case for a hackathon activity based around DRS, WES, TES, Passport etc. See 4 for where that might lead.

2. Within those parameters this intracranial tumor problem would be a good hackathon activity.
The activity would be to explore what Beacon Network could do to answer the Go Fish question described above. It would potentially involve those who understand the scientific question and members of the Beacon team who could a) assist with how to use the tool to accomplish the goal b) learn where their tool might be improved to help a user fulfill their case.

3. What a hackathon activity could be
During our conversation we went to the Beacon Network Interface and explored how we could do the query. Unfortunately we couldn't formulate the question in the terms that Beacon requires. We didn't know the a) the chromosome b) the location c) the variant. We hunted around other resources to find that information but were unsuccessful. Anne is following up to get that info.
This was illustrative of a hackathon activity in that it involved trying to solve a specific problem through hands-on collaborative use of an existing tool (code) and other online reference resources. In the course of what we did we came up with a requirement that one might submit to the Beacon Network GitHub.

4. A hackathon activity on Cloud OS/NextFlow
From a technology point of view, how GA4GH APIs might be used from the LifeBit CloudOS platform, using NextFlow, still seemed like a worthwhile activity. The driving scientific use case would have to emerge from elsewhere than the intracranial germ cell tumor problem.
We noted the Genomics England usage of CloudOS alongside the fact that Genomics England is a GA4GH Driver Project).

@ianfore
Copy link
Collaborator

ianfore commented Jul 6, 2020

Added link to Google Sheet categorizing possible activities listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Federated Federated Analysis Systems Project Priority: Medium
Projects
None yet
Development

No branches or pull requests

4 participants