Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Solr #4158

Closed
djbrooke opened this issue Sep 26, 2017 · 38 comments
Closed

Upgrade Solr #4158

djbrooke opened this issue Sep 26, 2017 · 38 comments
Assignees

Comments

@djbrooke
Copy link
Contributor

djbrooke commented Sep 26, 2017

We should upgrade Solr to at least a non-EOL version (<6.4).

It's important to be on a version where we can get support/patches if needed.

@pdurbin pdurbin changed the title Upgrade SOLR Upgrade Solr Sep 26, 2017
@pdurbin
Copy link
Member

pdurbin commented Oct 6, 2017

As we work on this issue we should also consider the list of security concerns in https://help.hmdc.harvard.edu/Ticket/Display.html?id=253987

  • Is it possible to secure the Solr admin console at http://localhost:8983/solr ? I'm not sure what this means exactly. Perhaps requiring a password?
  • How difficult would it be to encrypt traffic between Solr and Dataverse such as with HTTPS?
  • Should enableRemoteStreaming be enabled or not?

@kcondon
Copy link
Contributor

kcondon commented Oct 18, 2017

https://lucene.apache.org/solr/guide/6_6/upgrading-solr.html

Users upgrading from older versions are strongly encouraged to consult CHANGES.txt for the details of all changes since the version they are upgrading from.

Index Format Changes
Solr 6 has no support for reading Lucene/Solr 4.x and earlier indexes. Be sure to run the Lucene IndexUpgrader included with Solr 5.5 if you might still have old 4x formatted segments in your index. Alternatively: fully optimize your index with Solr 5.5 to make sure it consists only of one up-to-date index segment.

Note: the index format changes note was from just the latest version changes, I did not go through all changes.txt as indicated but this note seemed important and clear enough.

@matthew-a-dunlap matthew-a-dunlap self-assigned this Oct 23, 2017
matthew-a-dunlap added a commit that referenced this issue Oct 23, 2017
It looks like 7.1 is not much more work, but this is the lowest bar to start with.
@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Oct 23, 2017

I have updated dataverse to make it compile with 6.4.2 / 7.1.0 (both work if switched in the pom). Next step is to get the new server up and running.

With Solr 6+ the default is Solr uses a managed schema by default instead of a user-editable xml file. It can be switched back to the xml file with minimal changes https://stackoverflow.com/questions/37324603/ .

matthew-a-dunlap added a commit that referenced this issue Oct 25, 2017
I am unsure if this is actually the correct approach but am committing it as this story is getting backlogged.
@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Oct 25, 2017

The code as is still has errors when actually trying to write to solr, but I have worked through a number of them.

Note: The branch above is using 7.1.0, I see that as the best path forward as 6/7 seem similar in terms of dev work.

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Mar 14, 2018

When we upgrade solr, will we be upgrading our existing indexes (as noted by Kevin in #4158 (comment))? Or will we just reindex from scratch?

If we are looking to upgrade our indexes, what is expected related to that in this story?

@scolapasta

@kcondon
Copy link
Contributor

kcondon commented Mar 23, 2018

Issues found:
X -Under Installing Solr in Installations guide, prereq's change \ to /:
http://guides.dataverse.org/en/4158-update-solr/installation/prerequisites.html

cp -r configsets\_default .
cp: cannot stat `configsets_default': No such file or directory

X -Instructions on starting solr fails when run as root, needs -force:

bin/solr start
WARNING: Starting Solr as the root user is a security risk and not considered best practice. Exiting.
         Please consult the Reference Guide. To override this check, start with argument '-force'

X -Same with creating core:

bin/solr create_core -c collection1 -d server/solr/collection1/conf/
WARNING: Creating cores as the root user can cause Solr to fail and is not advisable. Exiting.
         If you started Solr as root (not advisable either), force core creation by adding argument -force

X -Same with init script:

service solr start
/etc/init.d/solr: line 1: 1: command not found
Starting Solr
WARNING: Starting Solr as the root user is a security risk and not considered best practice. Exiting.
         Please consult the Reference Guide. To override this check, start with argument '-force

So, maybe need to call out the recommendation on not to run it as root and how?

-Do we need to mention anything about upgrading Solr? Such as new init script (command line args), not run as root, shut down service (obvious I know)
X -More clarity on getting config files from zip. /tmp?
X -searching on "identifier" of a persistent id doesn't work from basic search but does work from advanced search.
X -Searching on just the dataset identifier does not work from basic search in solr 7 branch.
-Searching on other id/agency works but does not show the field name or string match in matching area. It does work on /develop. This may be a compound field issue since author name is displayed but author affiliation does not.

pdurbin added a commit that referenced this issue Mar 26, 2018
@pdurbin
Copy link
Member

pdurbin commented Mar 26, 2018

@kcondon after we talked I made many tweaks to the docs in 41a7ea9

Please take a look. Thanks.

pdurbin added a commit that referenced this issue Mar 26, 2018
@pdurbin
Copy link
Member

pdurbin commented Mar 27, 2018

As we work on this issue we should also consider the list of security concerns in https://help.hmdc.harvard.edu/Ticket/Display.html?id=253987

  • Is it possible to secure the Solr admin console at http://localhost:8983/solr ? I'm not sure what this means exactly. Perhaps requiring a password?
  • How difficult would it be to encrypt traffic between Solr and Dataverse such as with HTTPS?
  • Should enableRemoteStreaming be enabled or not?

I took a quick look at these questions. I think they're all out of scope in the sense that we probably won't address any of them when using Solr in production at Harvard Dataverse. Firewalling off Solr is enough for us. That said, it looks like the Solr project has documentation on each of the topics above:

People who are interested in these topics should read the documentation above.

pdurbin added a commit that referenced this issue Mar 27, 2018
"Match" was showing up because the bundle key didn't exist. We are
switching away from the deprecated "JsfHelper.localize" method.
pdurbin added a commit that referenced this issue Mar 27, 2018
Back in 60e640b when I was playing with spelling suggestions from Solr I
changed the request handler from "/select" (the default) to "/spell". We
didn't have time to fully explore the spelling suggestions feature of
Solr during the 4.0 rewrite and the "/spell" request handler seems to be
leading to other bugs, such as not being able to search on the
"identifier" portion of a DOI (i.e. "JNIUOA") from basic search. In
short we are switching to the default request handler for Solr,
something I would have done before tagging 4.0 if I had realized I had
left the "/spell" request handler in there.
@pdurbin
Copy link
Member

pdurbin commented Mar 27, 2018

At standup I mentioned that a basic search of the identifier was working from Solr directly but not from Dataverse. The difference is that my curl command against Solr was using the default /select request handler but Dataverse has been using the /spell request handler. In d3b721e I switched Dataverse back to the default /select request handler and included a long commit message explain that I was planning with the "spelling suggestions" ("did you mean?") feature of Solr early in the Dataverse 4.0 rewrite and inadvertently left /spell in the code. Now basic search works fine on an identifier (i.e. JNIUOA).

@pdurbin
Copy link
Member

pdurbin commented Mar 27, 2018

I just gave a brain dump to @kcondon about recent fixes, including how in a7fd43a I made it so that "Dataset Persistent ID" shows up instead of "Match":

screen shot 2018-03-27 at 10 43 05 am

@pdurbin pdurbin removed their assignment Mar 27, 2018
pdurbin added a commit that referenced this issue Mar 27, 2018
pdurbin added a commit that referenced this issue Mar 27, 2018
@pdurbin
Copy link
Member

pdurbin commented Mar 28, 2018

I checked in with @kcondon this morning and he mentioned that highlighting of search fields is not working as consistently as on the develop branch. I tested this as of a088d5e on the Solr branch and he's right. The good news is that the object can still be found with a search so it sounds like we might merge the pull request without a fix (we'd open a new issue and fix it later) but I'm assigning myself to this issue to at least poke around a bit and characterize the bug better. Perhaps I'll write some automated tests that exercise the bug.

@pdurbin pdurbin self-assigned this Mar 28, 2018
@pdurbin
Copy link
Member

pdurbin commented Mar 28, 2018

I'm on the same commit (a088d5e) and this highlighting bug is really strange. It seems to be based on the data entered.

For example, when filling in "otherIdAgency" if I use the value "agency1", the highlighting works. Notice "otherIdAgency" under "highlighting" at the bottom of this JSON output:

curl 'http://localhost:8983/solr/collection1/select?rows=1000000&wt=json&indent=true&hl=true&hl.fl=*&q=agency1'

{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"agency1",
      "hl":"true",
      "indent":"true",
      "hl.fl":"*",
      "rows":"1000000",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"dataset_124_draft",
        "entityId":124,
        "dataverseVersionIndexedBy_s":"4.8.5",
        "identifier":"doi:10.5072/FK2/M8KVFJ",
        "dsPersistentId":"doi:10.5072/FK2/M8KVFJ",
        "persistentUrl":"https://doi.org/10.5072/FK2/M8KVFJ",
        "dvObjectType":"datasets",
        "publicationStatus":["Unpublished",
          "Draft"],
        "dateSort":"2018-03-28T20:29:59.697Z",
        "dateFriendly":"Mar 28, 2018",
        "isHarvested":false,
        "metadataSource":"Root",
        "datasetVersionId":51,
        "citation":"Admin, Dataverse, 2018, \"429\", https://doi.org/10.5072/FK2/M8KVFJ, Root, DRAFT VERSION",
        "citationHtml":"Admin, Dataverse, 2018, \"429\", <a href=\"https://doi.org/10.5072/FK2/M8KVFJ\" target=\"_blank\">https://doi.org/10.5072/FK2/M8KVFJ</a>, Root, DRAFT VERSION",
        "nameSort":"429",
        "title":"429",
        "otherIdAgency":["agency1"],
        "authorName":["Admin, Dataverse"],
        "authorName_ss":["Admin, Dataverse"],
        "affiliation_ss":["Dataverse.org"],
        "authorAffiliation":["Dataverse.org"],
        "authorAffiliation_ss":["Dataverse.org"],
        "datasetContactName":["Admin, Dataverse"],
        "datasetContactAffiliation":["Dataverse.org"],
        "dsDescriptionValue":["test"],
        "subject":["Agricultural Sciences"],
        "subject_ss":["Agricultural Sciences"],
        "depositor":"Admin, Dataverse",
        "dateOfDeposit":"2018",
        "dateOfDeposit_s":"2018",
        "parentId":"1",
        "parentName":"Root",
        "_version_":1596215048611561472}]
  },
  "highlighting":{
    "dataset_124_draft":{
      "otherIdAgency":["<em>agency1</em>"]}}}

But if I fill in "otherIdAgency" with the value "agency", no highlight appears:

curl 'http://localhost:8983/solr/collection1/select?rows=1000000&wt=json&indent=true&hl=true&hl.fl=*&q=agency'

{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"agency",
      "hl":"true",
      "indent":"true",
      "hl.fl":"*",
      "rows":"1000000",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"dataset_90",
        "entityId":90,
        "dataverseVersionIndexedBy_s":"4.8.5",
        "identifier":"doi:10.5072/FK2/LM1ZLY",
        "dsPersistentId":"doi:10.5072/FK2/LM1ZLY",
        "persistentUrl":"https://doi.org/10.5072/FK2/LM1ZLY",
        "dvObjectType":"datasets",
        "dateSort":"2018-03-27T16:38:18.189Z",
        "dateFriendly":"Mar 27, 2018",
        "publicationStatus":["Published"],
        "publicationDate":"2018",
        "dsPublicationDate":"2018",
        "isHarvested":false,
        "metadataSource":"Root",
        "datasetVersionId":35,
        "citation":"Finch, Fiona, 2018, \"test search\", https://doi.org/10.5072/FK2/LM1ZLY, Root, V1",
        "citationHtml":"Finch, Fiona, 2018, \"test search\", <a href=\"https://doi.org/10.5072/FK2/LM1ZLY\" target=\"_blank\">https://doi.org/10.5072/FK2/LM1ZLY</a>, Root, V1",
        "nameSort":"test search",
        "title":"test search",
        "subject":["Medicine, Health and Life Sciences"],
        "subject_ss":["Medicine, Health and Life Sciences"],
        "subtitle":"subtitle",
        "otherIdAgency":["agency"],
        "authorName":["Finch, Fiona"],
        "authorName_ss":["Finch, Fiona"],
        "affiliation_ss":["Birds Inc."],
        "authorAffiliation":["Birds Inc."],
        "authorAffiliation_ss":["Birds Inc."],
        "datasetContactName":["Finch, Fiona"],
        "dsDescriptionValue":["Darwin's finches (also known as the Galápagos finches) are a group of about fifteen species of passerine birds."],
        "subtreePaths":["/89"],
        "parentId":"89",
        "parentName":"dvf9916eda",
        "_version_":1596214687686459392}]
  },
  "highlighting":{
    "dataset_90":{}}}

@pdurbin
Copy link
Member

pdurbin commented Mar 29, 2018

After standup this morning @djbrooke @scolapasta @kcondon and I talked about the highlighting issues we're seeing. I just did some initial investigation to give me confidence that highlighting seems to work just fine in Solr 7.2.1 when you use their example config and data. I just opened #4557 to post these results and so that we have a issue to estimate in the future. We decided that the highlighting issues are not a show stopper for merging pull request #4520.

@matthew-a-dunlap
Copy link
Contributor

fwiw I plan to take another look at our solr config tomorrow to see if something related to the highlighting shows itself, but I'd be surprised if I find anything.

@matthew-a-dunlap
Copy link
Contributor

I tried diffing the solr configs, but due to the porting of our customization between the two versions of the stock config this approach did not reveal any info with our minor bugs

@pdurbin
Copy link
Member

pdurbin commented Apr 2, 2018

I just merged the latest from develop into the pull request: 5f67e56

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants