Skip to content
This repository has been archived by the owner on Jul 5, 2021. It is now read-only.

fix: add robots.txt to exclude gateway paths #330

Closed
wants to merge 1 commit into from

Conversation

lidel
Copy link
Contributor

@lidel lidel commented Aug 28, 2019

This PR adds /robots.txt and closes #328

cc @andrew

Context: #328

License: MIT
Signed-off-by: Marcin Rataj <lidel@lidel.org>
Copy link
Contributor

@andrew andrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@cwaring cwaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I only see positives excluding these paths from the machines 🤖

@lidel lidel requested a review from olizilla August 28, 2019 11:31
@lidel
Copy link
Contributor Author

lidel commented Aug 28, 2019

cc @olizilla for visibility, as it impacts gateway

@cwaring
Copy link
Contributor

cwaring commented Aug 28, 2019

Related: you may find it interesting to review what traffic this will exclude from search indexes, here is an export of the top inbound links/queries over the past 16 months. A total of 31.3m clicks and 2.07B impressions 😮

@lidel
Copy link
Contributor Author

lidel commented Aug 28, 2019

@cwaring Was there a significant drop in traffic coming from Google search results in past few weeks or months?

I think they tweaked their crawler and wikipedia mirror from ipfs.io does now show in search results anymore for me. Still, merging robots.txt remains a good idea, as it would remove bogus load from the gateway when a crawler traverses entire wikipedia etc.

@cwaring
Copy link
Contributor

cwaring commented Aug 28, 2019

@lidel this is traffic over the last 3 months so nothing substantial. Since activating GSC I'm also seeing a few takedown notices for copyrighted content under /ipfs/, which could impact this root ipfs.io domain authority. Possibly something else to consider.

CleanShot 2019-08-28 at 15 20 21

@cwaring
Copy link
Contributor

cwaring commented Aug 28, 2019

ref: CleanShot 2019-08-28 at 15 28 40

@olizilla
Copy link
Collaborator

I'm not clear on why we want to prevent content on IPFS getting indexed?

@lidel
Copy link
Contributor Author

lidel commented Aug 28, 2019

@olizilla afaik everything starting with ipfs.io/ipfs/ already got removed from Google results as it was polluting search results with wikipedia mirror etc
if we are no longer listed in search results, automated crawler introduces unnecessary load and pollutes the cache without obvious benefits (or am I missing some?)

@olizilla
Copy link
Collaborator

I think as a general position we do want ipfs content to be indexed on search engines. The bots wont be guessing CIDs, so it's content that has been linked to from somewhere and, yes, in the case of wikipedia, it will eventually follow all the links, but in general if some site links to some content on ipfs, it should be indexed like any other.

@jessicaschilling jessicaschilling added dif/medium Prior experience is likely helpful topic/docs Documentation labels Aug 28, 2019
@lidel
Copy link
Contributor Author

lidel commented Aug 28, 2019

What if only exclude the main offender?

I believe the main problem is the CID root of wikipedia that was published without proper meta tag: https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/*

Context on meta tag: ipfs/distributed-wikipedia-mirror#48

@autonome
Copy link
Contributor

What is the problem with search engines indexing everything on all gateways? I thought we would want content to be indexed and findable on gateways as an intermediary step towards native protocol loading.

@cwaring
Copy link
Contributor

cwaring commented Aug 29, 2019

This has highlighted a few issues for me, some thoughts:

  1. Indexing sets a precedent that this is the permanent access point for this resource, I feel that is the wrong message (and makes IPFS feel broken when it cannot resolve a result).
  2. Crawling seems to be causing substantial load on our infrastructure degrading the experience for everyone (at this point in time).
  3. Using a subpath as our primary gateway instead of a different domain or subdomain.ipfs.io could potentially devalue the ipfs.io domain authority due to multiple flags for copyright violations. Disabling indexing would help however it might be a better idea to migrate to a new domain (I don't have all the information behind the current configuration to understand the implementation decisions here).

Keen to hear your ideas!

@lidel
Copy link
Contributor Author

lidel commented Aug 29, 2019

I agree with raised concerns and think we need more data to make this decision.
Closing this PR to ensure we don't merge by accident.


Ad 1. Gateways are the best we can do in browsers right now. UX around ipfs.io is pretty good and enables migration path in the future: if browser lands native support it will be ignoring gateway host and use content paths, reviving dead links (as long there are peers hosting the data).

Ad 2. Are we tracking user-agent header on our gateways?
It should tell what is the impact of automated crawlers (here is list of google ones), and maybe see what is the overlap between URLs accessed by crawlers and humans.

If it is not meaningful, we probably should open a PR to only exclude immutable wikipedia snapshot /ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco
I opened PR to fix "wikipedia seo issue" in upstream tooling (openzim/mwoffliner#963), but old snapshots would remain broken.

Ad 3. We've been talking about move to dweb.link, but it is subdomain gateway, meaning https://{cid}.dweb.link instead of https://ipfs.io/ipfs/{cid}, and makes it non-trivial for /ipns/ paths (IPNS and DNSLink, ipfs/kubo#5287).

At this point I worry ts may be too late to improve SEO of ipfs.io, /ipfs/* results are already gone from Google. Unless we reach out to search engines and make our case. Perhaps we should keep ipfs.io for path-based gateway (https://ipfs.io/ipfs/{cid}), use dweb.link for subdomain-based gateway and move project website from ipfs.io to something else? (the history behind .io is pretty grim)

@lidel lidel closed this Aug 29, 2019
@hsanjuan
Copy link
Member

hsanjuan commented Aug 29, 2019

use dweb.link for subdomain-based gateway and move project website from ipfs.io to something else? (the history behind .io is pretty grim)

Based on some discussion in gateway team, this seems to be a way forward that has been brought up, as keeping ipfs.io as gateway is important to avoid fragmentation.

What if only exclude the main offender?

I believe the main problem is the CID root of wikipedia that was published without proper meta tag: https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/*

Context on meta tag: ipfs/distributed-wikipedia-mirror#48

I would suggest doing this. Also since everytime you "something ipfs" you get wikipedia results for that something, hosted by ipfs, rather than what you are looking for.

Crawling seems to be causing substantial load on our infrastructure degrading the experience for everyone (at this point in time).

I don't think crawlers are behind much of the pains but we could check: @lanzafame am I right that crawlers weren't anyone close to the top offenders?

@lidel
Copy link
Contributor Author

lidel commented Sep 2, 2019

Continued in #334

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
dif/medium Prior experience is likely helpful topic/docs Documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hotfix: Missing robots.txt
7 participants