Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document search is not working anymore #500

Closed
Tracked by #524 ...
aboydnw opened this issue May 9, 2022 · 46 comments
Closed
Tracked by #524 ...

Document search is not working anymore #500

aboydnw opened this issue May 9, 2022 · 46 comments
Assignees
Labels
development A task for the DS development team on APT

Comments

@aboydnw
Copy link

aboydnw commented May 9, 2022

Here is the error in the console log when trying to conduct a search:
image

@leothomas
Copy link
Collaborator

Do you know if this occurring both staging and production? If not, which environment is it occurring in?

@aboydnw
Copy link
Author

aboydnw commented May 24, 2022

I pulled this screenshot from MCP. I'm not sure if it's happening in staging as well, since the entire documents page isn't accessible there.

@leothomas
Copy link
Collaborator

So it seems that the search functionality is broken for both the staging and prod stacks, but for seemingly different reasons.

Prod:

The error message reads:

{"detail":"{\"message\":\"Credential should be scoped to a valid region, not 'us-east-1'. \"}"}

Which I believe is just telling us that we're not signing the request with the correct region, similarly to this ticket.

We are hardcoding the region as us-east-1 when signing the request, whereas in MCP the Elasticsearch Domain is deployed to us-west-2 (at Shawn's request - they get a discount in that region):

 aws cloudformation describe-stack-resources --stack-name nasa-apt-api-lambda-prod --query 'StackResources[?ResourceType==`AWS::Elasticsearch::Domain`]'
[
    {
        "StackName": "nasa-apt-api-lambda-prod",
        "StackId": "arn:aws:cloudformation:us-west-2:237694371684:stack/nasa-apt-api-lambda-prod/c2aeefb0-7fb7-11ec-9b7d-02f9de4065db",
        "LogicalResourceId": "nasaaptapilambdaprodelasticsearchdomainF69BA438",
        "PhysicalResourceId": "apt-api-lambda-prod-elastic",
        "ResourceType": "AWS::Elasticsearch::Domain",
        "Timestamp": "2022-01-27T22:39:35.796000+00:00",
        "ResourceStatus": "CREATE_COMPLETE",
        "DriftInformation": {
            "StackResourceDriftStatus": "NOT_CHECKED"
        }
    }
]

I strongly that changing hardcoded region from:

region = "us-east-1"

to

region = os.environ["AWS_REGION"]

would resolve the error, however I have not deployed the fix to production, because as of this moment, we do not have the ability to re-deploy a stack (if needed) due to the MCP block on ApiGatewayV2::HTTPApi resource types, and I haven't deployed this fix to staging, because 1) staging is deployed to us-east-1 and 2) staging is failing for a different reason:

Staging:

Staging is failing with a very puzzling error;

<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n</body>\r\n</html>\r\n

with no traces in the lambda logs and only logging a 502 error in the Api logs.

The 502 error seems to indicate an integration error. I will report back once I have more information

@leothomas
Copy link
Collaborator

Update: the 502 error isn't being generated by the Lambda function or the API Gateway instance, it's generated by the Elasticsearch node itself.

A few strange things to note about the Elasticsearch instance:

Last night:

  • Last night I couldn't load the instance info in the AWS console
  • The Elasticsearch instance had no status/allocated memory in the Opensearch Domain overview tab:
    image
  • Querying the Elasticsearch instance through the command line showed that the instance was "up and running as normal"
  • I added logging to the Elasticsearch instance through the command line

This morning:

  • The Elasticsearch instance now loads in the AWS console
  • The Elasticsearch instance seems to be stuck in some sort of update processing state:

image

(I'm not sure how this was triggered, perhaps by adding the logs last night?)

@leothomas
Copy link
Collaborator

leothomas commented Jun 1, 2022

Potential resolution: Migrate to Opensearch

Elasticsearch was open source until recently. After the company that makes Elasticsearch announced a more restrictive license on future versions of Elasticsearch, AWS forked Elasticsearch and host it as a service called Opensearch, which they will maintain open sourced going forward. ref

The APT stack currently uses the Elasticsearch 7.7 engine (the last open sourced version before new license), so the only updates that will be available going forward are if we migrate to an Opensearch service. Since Opensearch is just a fork of Elasticsearch, AWS says that Opensearch will be backwards compatible with Elasticsearch without requiring updates to client code.

Migrating to Opensearch would not explain why the Elasticsearch instance entered and is stuck in a un-reachable state, but it would likely allow us to side step the un-reachable Elasticsearch instance and ensure that we can continue to stay up to date with Opensearch developments.

(Note: staying up to day with Opensearch is a small advantage - our usage of the search services is so basic that I doubt any of the updates would be crucial to APT, however, it's still a "nice to have")

Consideration:

It's possible to upgrade the instance from Elasticsearch (v7.7) to Opensearch (v1.2) from the AWS console, but I strongly suspect that behind the scenes, the update will just delete the Elasticsearch domain and create an Opensearch one with the same name, losing all stored/indexed data.

I don't have a problem with losing the indexed documents in staging and when it comes to the data that we will be importing from the UAH prod stack to the MCP prod stack, none of the crucial ATBDs are published which means that they aren't yet indexed in the UAH prod Elasticsearch instance. We would be free to start with a fresh search instance, in this case OpenSearch, as opposed to Elasticsearch, in the new MCP prod stack.

Options to migrate the data (if we do want to migrate it) include:

  • The AWS-recommended procedure: take a snapshot of the instance, upload the snapshot to an S3 bucket, grant permissions to the new Opensearch instance to access the S3 bucket, load the new Opensearch instance from the snapshot
  • Build re-indexing logic into the APT API that re-indexes all of the published ATBDs. This used to exist in the APT API, and I think is a good idea as a fail safe in case the background task that is in charge of indexing the ATBD fails.
    • Note: ATBD indexing happens whenever a document is published or whenever the minor version is bumped, however, each minor version bump gets indexed separately (eg: v1.1, v1.2, v1.3, v2.0, v2.1). This won't be possible to maintain if we implement a re-index functionality, since the indexing process pulls directly from the database, and the database does not store every minor version. However I would argue that we can do away with that requirement (if it ever even was a requirement to begin with) since the purpose of the search functionality is to find documents by keywords. I don't think it makes much sense to index terms from a document that is no longer editable/viewable in the APT web-app (it could still be downloaded as a PDF, however)

@leothomas
Copy link
Collaborator

Quick update on the search functionality in the staging instance: The AWS console still shows the Elasticsearch instance in an "in-progress" state, however the instance is now responsive. All of the documents that were indexed before the crash are lost, but I was able to index a new one and search it:
eg: querying keyword "ocean"

curl -X 'POST' \
  'https://0mrzuyq2e3.execute-api.us-east-1.amazonaws.com/v2/search' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer $AUTH_TOKEN' \
  -H 'Content-Type: application/json' \
  -d '{"query":{"bool":{"must":[{"multi_match":{"query":"oceans"}}],"filter":[]}},"highlight":{"fields":{"*":{}}}}'

result:

{
  "took": 598,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.54655457,
    "hits": [
      {
        "_index": "atbd",
        "_type": "atbd",
        "_id": "19_v1",
        "_score": 0.54655457,
        "_source": {
           ...
        }
      ....
      ]
    }

@aboydnw
Copy link
Author

aboydnw commented Jun 13, 2022

Update: we would like to use option 2.1 outlined in #509

@leothomas
Copy link
Collaborator

Update: after deploying an update to MCP which use the region from the lambda's execution context to sign the Elasticsearch request, the document search still appears to be broken with an error message indicating that the signing credentials are scoped to us-east-1 (when the ES instance is deployed to us-east-1).

@naomatheus
Copy link
Collaborator

Proposed Solution 1

@naomatheus naomatheus self-assigned this Jul 21, 2022
@aboydnw aboydnw added the development A task for the DS development team on APT label Jul 21, 2022
@naomatheus
Copy link
Collaborator

naomatheus commented Aug 4, 2022

I've followed steps to create a manual snapshot of the current elastic search index. However, the instance appears locked in "processing," and the Opensearch Dashboard is then inaccessible. I've updated the security configurations on the opensearch domain in DevSeed AWS, but the changes are not applying since the instance appears locked.

I was able to get through IAM roles and permissions required to create a snapshot of indices, but was not able to create the snapshot because the instance is locked.

There is an option to use the AWS dashboard to force through an update of this domain. And I believe in current indices may be lost during that process.

I am recommending both of the following:

  1. Processing the update from ElasticSearch to OpenSearch through the AWS console (risk losing indexes)
  2. Deploy a new opensearch domain.
  3. Configure the new opensearch domain to automate index snapshots and back them up to S3.

I'll come back to this ticket after some discussion, and I've prepared the infrastructure code to deploy a new opensearch stack.

Here's a TLDR of what remains the same when migrating from ElasticSearch to OpenSearch.

Upgrade to OpenSearch

The upgrade is automated and can be done automatically through the AWS management console. However, any changes made through the console would not be automatically reflected in CDK infrastructure code. So in this branch the infrastructure code has been updated. Notes are added on the following affected areas. Here are summaries of what will need to be changed and what can be left alone.

New API version

New event format

What's staying the same?

  • The following features and functionality, among others not listed, will remain the same:
    • Service principal (es.amazonaws.com)
    • Vendor code
    • Domain ARNs
    • Domain endpoints

@naomatheus
Copy link
Collaborator

Consideration
We don't have a way to handle permissions in opensearch, elasticsearch.
in APT, we do not have a way to limit the index search capability in line with user document access permissions.
Adding permissions to the opensearch/elasticsearch layer.

Current solution:
Documents only get indexed when documents are published i.e. they're available to everyone.

@naomatheus
Copy link
Collaborator

Unblocked: There are no critical ATBDs/documents in APT in Prod accounts.
So there will be no need for migration steps to include creating a snapshot of current indices.
Those in development will also be lost when upgrading from ElasticSearch to OpenSearch.

@naomatheus
Copy link
Collaborator

Opensearch upgrade:
develop...open-search-upgrade-1

Ready to deploy pending review.

@naomatheus
Copy link
Collaborator

naomatheus commented Aug 12, 2022

Remaining steps:

  1. Change deployed open search endpoint from https://0mrzuyq2e3.execute-api.us-east-1.amazonaws.com/v2/search to https://2t2dh1w620.execute-api.us-east-1.amazonaws.com/search.
  2. The new OpenSearch indices will be blank, and test ATBDs should be created in the stage environment.
  3. Search for contents of the new, indexed ATBDs

@naomatheus
Copy link
Collaborator

See this branch and PR thread for changes made.
Error state: ATBD documents displayed no documents

Current state:
ATBD documents now displays documents

Next steps:
Change deployed open search endpoint from https://0mrzuyq2e3.execute-api.us-east-1.amazonaws.com/v2/search to https://2t2dh1w620.execute-api.us-east-1.amazonaws.com/search

This was referenced Sep 8, 2022
@naomatheus
Copy link
Collaborator

Upcoming changes to Backend API for Opensearch compatibility

  • Replace requests_aws4auth with opensearch-py/opensearchpy
    • requests_aws4auth does not currently support AWS Opensearch service and is maintained outside of the Opensearch organization
    • opensearch-py is supported and maintained by Opensearch.org, is opensource licensed, and is compatible with Opensearch v1.0.1, our current version of Opensearch
    • Affected modules are elasticsearch.py app/api/v2/elasticsearch at and elasticsearch.py at app/search/elasticsearch.py

PR forthcoming, in the meantime see app/api/v2/elasticsearch.aws_auth. AWS auth function returns an object used to authorize post requests to Elasticsearch domain. In the update, the client must be used directly with queries. (reason: opensearch-py is low level client for now)

  • Setup.py is deprecated (?)
    • This should be replaced with a shell script that uses pip for package management
    • setup.py install is deprecated

@bwbaker1
Copy link
Collaborator

@naomatheus @wrynearson @aboydnw @deborahUAH After testing open search on staging, it does work, but there are still a few issues.

(1) A search term must be entered. We would like for a user to be able to search for any ATBDs published in a specific year. For example, leave the search term blank and just search for all ATBDs published in 2022.

apt_1

(2) When a search term is included, the year must still be "All." The image below shows no document matching "contributor" when searching using the year as 2022. This particular ATBD was published in 2022.

apt_2

Below is the search results with the search term "contributor" and the year is set to "all."

apt_3

@wrynearson
Copy link
Member

Thanks @bwbaker1. I think we want to prioritize having a functional production environment quickly. To do so, would you and @deborahUAH agree to push our current implementation of this ticket from staging to production?

@bwbaker1
Copy link
Collaborator

@wrynearson My opinion is that it is functional enough to go ahead and push to PROD. Then we can get the DEMO ATBD public since that is a big deal. The other parts of this ticket can be fixed afterward. But this is @deborahUAH decision.

@deborahUAH
Copy link
Collaborator

deborahUAH commented Nov 2, 2022

Brad is correct. We need to download the real DEMO ATBD that @bwbaker1 has created in prod to see that it looks ok before we publish it in prod (because at that point it will be visible for good). Publishing in MCP Prod is not a testing space. This is why we needed a staging space on MCP and I thought one was created months ago?? (there was a ticket for it somewhere - please add # if you find it).

The order in MCP Prod must be 1) download the DEMO ATBD PDF and check that it looks good (enough), then publish it, then check the ability to see it when searching.
@wrynearson @naomatheus

@bwbaker1
Copy link
Collaborator

bwbaker1 commented Nov 2, 2022

@wrynearson @deborahUAH I just checked and I no longer see the test ATBD on PROD. So it seems you were successful in deleting that document.

@deborahUAH
Copy link
Collaborator

see my clarification on deletions on ticket #548

@naomatheus
Copy link
Collaborator

Hey @wrynearson . Are we able to move this issue to done for now?
I'd suggest splitting off another issue if there's more to do here for this specific issue.

@wrynearson
Copy link
Member

Hey @naomatheus , sorry I just saw this.

@deborahUAH is the owner of this repo, so I'm not able to close tickets. She treats done as it's deploy to prod and tested. Once their team writes a demo ATBD, downloads it while it's in draft, then publishes it and makes sure it's indexed, then they will close it.

@naomatheus
Copy link
Collaborator

Gotcha. Noted @wrynearson

@deborahUAH
Copy link
Collaborator

awaiting more than 1 document so this can be tested on Production.

@wrynearson
Copy link
Member

@deborahUAH @bwbaker1 now that we have Document PDFs working in production, this ticket is unblocked. I've marked it for your review.

@wrynearson wrynearson assigned bwbaker1 and deborahUAH and unassigned naomatheus Apr 4, 2023
@bwbaker1
Copy link
Collaborator

bwbaker1 commented Apr 4, 2023

Note: we can't review until we have published documents.

@bwbaker1
Copy link
Collaborator

@wrynearson Search is no longer working on staging. It does not provide error or anything.

Screen.Recording.2023-04-10.at.11.32.38.AM.mov

@wrynearson
Copy link
Member

wrynearson commented Apr 11, 2023

Thanks for flagging this @bwbaker1

@naomatheus and maybe @thenav56 - could you look into this? I'm getting 500 errors in the console, so this might be an AWS issue.

image

@wrynearson
Copy link
Member

@sunu @thenav56, do we need to make any further updates on production to make sure that this ticket can be close (e.g. #706)?

@wrynearson
Copy link
Member

@thenav56 @sunu @batpad currently, no results are appearing on staging for document search. There is no error, but results that we think should appear aren't.

cc @bwbaker1

@bwbaker1
Copy link
Collaborator

bwbaker1 commented May 1, 2023

⬆️ Here is a quick screen recording of searching.

Screen.Recording.2023-05-01.at.11.48.37.AM.mov

@wrynearson
Copy link
Member

wrynearson commented May 2, 2023

@bwbaker1 the issue is that public documents that were made public before we implemented #699 are not "indexed" (i.e., not searchable).

We made two documents on staging that are searchable now:

Search.Index.mov

This won't be an issue on production because we don't have any published documents there yet, so once the first document is published, it will be indexed. However, we will go back and re-index the existing published documents on staging to not cause any further confusion.

cc @thenav56 @sunu

@bwbaker1
Copy link
Collaborator

bwbaker1 commented May 2, 2023

@wrynearson This makes sense. I remember this happening awhile back when search was updated/fixed.
@thenav56 @sunu @batpad Thanks for getting this solved so quickly!

@wrynearson
Copy link
Member

@thenav56 should we wait to push this to production until #718 is ready? cc @sunu @batpad

@sunu
Copy link
Collaborator

sunu commented May 4, 2023

@wrynearson IMO there is no need to wait since the production instance doesn't have any published ATBDs anyway

@thenav56
Copy link
Collaborator

thenav56 commented May 4, 2023

@wrynearson Yes, as @sunu said we are good to push this to production.
#718 can wait as it will add a feature to fix ES issues if there are any in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development A task for the DS development team on APT
Projects
None yet
Development

No branches or pull requests

9 participants