Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path split at last point instead of first point (Suggested Improvement) #177

Closed
Chrissi2802 opened this issue Jun 9, 2024 · 1 comment
Closed

Comments

@Chrissi2802
Copy link

Description

The current code in marker/output.py separates the path at the first point, which can cause problems with scientific papers. These papers often have file names in the format '1706.03762.pdf', which causes the folder name to be erroneously '1706' instead of '1706.03762'. This can cause folders with different papers to be overwritten.

Example to illustrate

  • File name: '1706.03762.pdf'
  • Current folder name: '1706'
  • Desired folder name: '1706.03762'
    This change ensures that the folder name is generated correctly and no data is lost.

Advantage

Separating the .pdf at the end of the filename instead of at the beginning ensures that scientific papers are saved in the correct folders and that no data is overwritten.

Important note

This is especially important for scientific papers such as those found on arxiv.org.

Current code (output.py line 6):

subfolder_name = fname.split(‘.’)[0]

Suggested new code:

subfolder_name = fname.rsplit(‘.’, 1)[0]
aniketinamdar added a commit to aniketinamdar/marker that referenced this issue Jun 10, 2024
VikParuchuri added a commit that referenced this issue Jun 17, 2024
Path Split at last point instead of first. Issue #177
@VikParuchuri
Copy link
Owner

Merged this fix, thanks for raising

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants