Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle more than 50,000 entries in the sitemap #8936

Closed
PaulBoon opened this issue Aug 25, 2022 · 8 comments · Fixed by #10321
Closed

Handle more than 50,000 entries in the sitemap #8936

PaulBoon opened this issue Aug 25, 2022 · 8 comments · Fixed by #10321
Labels
Feature: Metadata Size: 10 A percentage of a sprint. 7 hours. Type: Bug a defect User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh
Milestone

Comments

@PaulBoon
Copy link
Contributor

PaulBoon commented Aug 25, 2022

What steps does it take to reproduce the issue?
Generate a sitemap for an archive that has more than 50k datasets

  • What happens?
    A single sitemap.xml file is generated, but Google only wants sitemap files with 50k or less URL's in it, so it won't be used for indexing.

  • What did you expect to happen?
    Dataverse should split up the sitemap entries over several files and reference them in a sitemap index file. See: https://developers.google.com/search/docs/advanced/sitemaps/large-sitemaps

Any related open or closed issues to this bug report?

@landreev
Copy link
Contributor

@PaulBoon Do you happen to know for the fact if this is still a problem? I.e., if Google is still enforcing this limit?
I was under the impression/assumption that they were no longer applying it, but I'm seeing some evidence to the contrary now, and they appear to still mention it in their documentation.
It really looks like we need to address this in the code to be safe.

@PaulBoon
Copy link
Contributor Author

@landreev This is a while back, but I do remember that the Google Search Console was driving me mad.
It might be that Google is not very strict on the limit, but if you have 100k+ it was complaining the last time I looked.
We now have a Python script in place that splits up the sitemap every night via cron. However we do have problems with Google, not being clear what and when and how it is doing things, their indexing is intentionally really a black box.
We do have problems getting all the published datasets properly indexed by Google, as others from the Dataverse community also have. It might be good if we shared our combined knowledge somehow.

@pdurbin
Copy link
Member

pdurbin commented Jan 5, 2024

@PaulBoon yeah. Can you please upload your script here? Maybe someone can use it, for now, until we implement a proper solution in Dataverse itself.

@landreev
Copy link
Contributor

landreev commented Jan 5, 2024

@PaulBoon thank you. I've been looking into all of this, and yes, it will be a good idea to combine and document all the solutions/tips we may find.
BTW, did it actually work in your case, supplying the sitemap index to the bot by simply adding it to your robots.txt? - I did try that, via this line:

sitemap: https://dataverse.harvard.edu/sitemap_index.xml

but the bot just kept stubbornly using the combined sitemap we had there previously. I had to go into the search console and force-submit the index there. (although there's a chance I simply didn't wait long enough and it would have switched to it eventually - ?)

There appears to be lots of small idiosyncratic things like this when trying to appease the bot.

@PaulBoon
Copy link
Contributor Author

PaulBoon commented Jan 10, 2024

@pdurbin The script we use to split, scraped it from the internet sometime ago.
This is templated in our ansible deployment scripts.

splitter.py below

#!/usr/bin/env python3

import os
import sys
from xml.sax import parse
from xml.sax.saxutils import XMLGenerator
import datetime

# based on code from https://github.com/realitix/sitemap_splitter
# needs to be run from same directory as the sitemap.xml file

BASE_URL = "{{ dataverse.payara.siteurl }}/sitemap/"
BREAK_AFTER= 2500 

class CycleFile():
    def __init__(self, filename):
        self.basename, self.ext = os.path.splitext(filename)
        self.index = 0
        self.filenames = []
        self.open_next_file()

    def open_next_file(self):
        self.index += 1
        filename = self.name()
        self.file = open(filename, 'w')
        self.filenames.append(filename)

    def name(self):
        return'%s%s%s' % (self.basename, self.index, self.ext)

    def cycle(self):
        self.file.close()
        self.open_next_file()

    def write(self, str):
        self.file.write(str.decode('utf-8'))

    def close(self):
        self.file.close()


class XMLBreaker(XMLGenerator):
    def __init__(self, break_into=None, break_after=1000, out=None, *args, **kwargs):
        XMLGenerator.__init__(self, out, encoding='utf-8', *args, **kwargs)
        self.out_file = out
        self.break_into = break_into
        self.break_after = break_after
        self.context = []
        self.count = 0

    def startElement(self, name, attrs):
        XMLGenerator.startElement(self, name, attrs)
        self.context.append((name, attrs))

    def endElement(self, name):
        XMLGenerator.endElement(self, name)
        self.context.pop()

        if name == self.break_into:
            self.count += 1
            if self.count == self.break_after:
                self.count = 0
                for element in reversed(self.context):
                    self.out_file.write(b"\n")
                    XMLGenerator.endElement(self, element[0])
                self.out_file.cycle()

                XMLGenerator.startDocument(self)
                for element in self.context:
                    XMLGenerator.startElement(self, *element)


def generate_index(base_url, filenames):
    now = datetime.datetime.now()
    dt = now.strftime("%Y-%m-%d")
    index_content = """<?xml version="1.0" encoding="UTF-8"?>
    <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    """

    for filename in filenames:
        index_content += """
            <sitemap>
                <loc>{}</loc>
                <lastmod>{}</lastmod>
            </sitemap>
        """.format(base_url+filename, dt)

    index_content += """
    </sitemapindex>
    """

    # Move current sitemap to backup and write the other one
    os.rename('sitemap.xml', 'backup_sitemap.xml')
    with open('sitemap.xml', 'w') as f:
        f.write(index_content)


def run():
    filename = "sitemap.xml"
    break_into = "url"
    break_after = BREAK_AFTER
    cycle = CycleFile(filename)
    parse(filename, XMLBreaker(break_into, int(break_after), out=cycle))
    generate_index(BASE_URL, cycle.filenames)


if __name__ == '__main__':
    run()

We also have two bash script that are used to get it working as a cronjob.
The job will run: "/home/{{ shared_payara_user }}/bin/generate-sitemap.sh 2>&1 | /usr/bin/logger -t generate-sitemap"

generate-sitemap.sh

#!/bin/bash

# Update the dataverse sitemap
# see: https://guides.dataverse.org/en/latest/installation/config.html#creating-a-sitemap-and-submitting-it-to-search-engines
# The sitemap.xml file will be generated in /var/lib/payara5/glassfish/domains/domain1/docroot/sitemap/

SITEMAP_DIR="/var/lib/payara5/glassfish/domains/domain1/docroot/sitemap"
BIN_DIR="$HOME/bin"
SPLIT_DIR="/tmp/sitemap"

# Split existing sitemap first, this is the previously generated one.
# Needed because we don't know when the new sitemap file is ready so we lag behind one
if [[ -f $SITEMAP_DIR/sitemap.xml ]]
then
  (
    $BIN_DIR/splitup-sitemap.sh
  ) || {
    exit 1
  }
fi

CURL_OUT="curloutput.txt"
(
  # Try to update the sitemap
  # Run curl, and stick all output in the temp file
  /usr/bin/curl --silent --show-error -X POST http://localhost:8080/api/admin/sitemap > "$CURL_OUT" 2>&1
) || {
  # If curl exited with a non-zero error code, send its output to stderr so that
  # cron could e-mail it.
  # You can test this by stopping the payara service for instance
  cat "$CURL_OUT" 1>&2
  rm "$CURL_OUT"
  exit 1
}

# Otherwise curl completed 
# but maybe the result status was not 'OK'
if ! grep -q "^{\"status\":\"OK\"" "$CURL_OUT"; then
  # If it does not start with the OK status, also have error output in cron
  # This will be the case when there is a sitemap.xml.staged file for instance
  cat "$CURL_OUT" 1>&2
  rm "$CURL_OUT"
  # Remove any staged file, otherwise next update attempt will also fail
  rm -f $SITEMAP_DIR/sitemap.xml.staged
  exit 1
fi

# Everything seems OK, so send the output to stdout (which
# should be redirected to a log file in crontab)
cat "$CURL_OUT"
rm "$CURL_OUT"

and split-up-sitemap.sh

#!/bin/bash

SITEMAP_DIR="/var/lib/payara5/glassfish/domains/domain1/docroot/sitemap"
SPLIT_DIR="/tmp/sitemap"
BIN_DIR="$HOME/bin"

rm -rf $SPLIT_DIR
mkdir $SPLIT_DIR
cp $SITEMAP_DIR/sitemap.xml $SPLIT_DIR/
cp $BIN_DIR/splitter.py $SPLIT_DIR/
( 
   cd $SPLIT_DIR
   ./splitter.py
) || {
   rm -rf $SPLIT_DIR
   exit 1
}
mv $SPLIT_DIR/sitemap.xml $SPLIT_DIR/sitemap_index.xml
rm $SPLIT_DIR/backup_sitemap.xml
rm $SPLIT_DIR/splitter.py
cp $SPLIT_DIR/* $SITEMAP_DIR/
rm -rf $SPLIT_DIR

I do see some payara5 hardwired and some more of our ansible vars, but you get the general idea.

@PaulBoon
Copy link
Contributor Author

Sorry, I accidentally closed the issue

@PaulBoon PaulBoon reopened this Jan 10, 2024
@pdurbin
Copy link
Member

pdurbin commented Jan 10, 2024

Awesome, thanks @PaulBoon

@cmbz
Copy link

cmbz commented Jan 30, 2024

2024/01/29

  • Prioritized following Slack conversation with @scolapasta

jeromeroucou added a commit to Recherche-Data-Gouv/dataverse that referenced this issue Jan 31, 2024
@landreev landreev added the Size: 10 A percentage of a sprint. 7 hours. label Feb 12, 2024
@scolapasta scolapasta assigned scolapasta and unassigned scolapasta Feb 29, 2024
pdurbin added a commit that referenced this issue Apr 24, 2024
@pdurbin pdurbin added this to the 6.3 milestone May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Metadata Size: 10 A percentage of a sprint. 7 hours. Type: Bug a defect User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants