Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected token <, "<html <h..." is not valid JSON error message when translating long text #65

Closed
ksdavidc opened this issue Jan 27, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@ksdavidc
Copy link

If I try to translate along text, such as Japanese Lorem Ipsum given below I get a message like unexpected token <, "<html <h..." is not valid JSON

I tried cutting this down by removing one paragraph at a time from the bottom, and eventually got the message. But, interestingly, if tried from the top (i..e. including the paragraphs previously causing the error) the error did NOT appear, and it worked. This makes me think it is the length, not the content of the message that is causing the error.


Below is Japanese Lorem Ipsum (not the text I originally used, but this also throws the error)

延質イつ刊帯じにえラ性号あ種質ッくちご読示ヱメマ況営レキカシ属断チミラヲ候員つひおわ著捨ゆだに更竹野いぱ橋2東イ覧方ぞげと雨闘イ不仮稼戦みト。3旅ソ味一街ルキサコ拡続ヘケ持百ラやぜ暮豊げせろ図本ヤ米界わむてえ渡会タヲネ健変情59謙カフヱツ樹観ちみ憧檀タ旬速画意とへぽ関20荒衝貨例きとろ。

注次ムユタ謙酒トニ転3科日じほずひ日年裁けおレ書聞コ上残ぱクぐせ量芸ッすつふ橋米ロナスム暮委わ最算レ目算サ東比ラハステ月界イじらぼ辺理でつばべ町怠ケロ熊幕彰ずか。70航るゆん下6生ゆゃ転芸タク暗自競落ワ優判ぼざ満枠づぞ国員細しの曜取サオ要足ハ情熊よぎえ被読オヒ後合ミヲオシ遊4著ハチソ校村ホアキレ長告亮凱孫フ。

代ほラを究34親革6本東リ意決目れ護在けばイリ男択ネア貨品び掲見あゅ瞬県童ぐきやフ議記リみ長現リルエテ頭素ぜめ。4全モヨユ鯨川フ変真ゆフレ毛八ツ客伸さや堀読あろイぱ両文カ次賛オ独自んそちー護私じぞし試実アヌ番対ウチ刊逮ニ内運へ割認ンちょ薬間放跡細彼精ぴろゃ。運ヘ何地タラヘル合奉モ量載1作ーどく歳94広スオハ入圧ろゃんめ明販昔庫高キオム意子レぜりぶ都隆楽ぐゃみ。

首ワスナ属図退ヒ真換活ヌ券支メトハカ第観ょぜ邸3岡めまもび川半活ぼえぱ可高ウセコ韓胎た声一けぱり土散観け題原びざ週埼街ゅ。監ヲク約54培入レ子士ぼ急明へあど帰多びと急造ム繰見道れべ面知サキソツ回議観レ行28波ほぞねあ反合朝管に。功タレヌカ向係裁ヒエホス近輔チサヌロ門表スケ愛属をら試帯イユヘ戦磐ぴ応質監ろゆ道2海ぐらクが人俸愚わゅラ。

繰サクトニ田応うぱ明教フユ大選年ノコサヒ斉改他祉力ア隊報シノモウ歴王おッね便会吉海撤ほやゆ。総そさ断真建よ示詳リマヱ立産モ君院ぜこい権都ホシアヲ第十ぶレの市率うスドす歩提づだ権唱どさ相門典囲ぽのでお。4医ニ標過か係満サ校生レセチオ載遺け示応被てげ沢街トキ更毎らは国五ょぜづ都演彦ヒヨ住徳個京ぼ。

効交モ田早フカリ的隊作キクリフ次68新カモチ知務ヌユイ省日みそ果送ネトム数6後ヱヤチ新備にそ将脚めをや心題エ体竹ふんす由9広ろきル分県ル特禁ホセケ愛坊斉おぐ。識どん感損ざ和象幡をげやぽ報随の節夕でばむい束待産ロイキ文暑ぐうは界再ー解逃ね条画そびをぞ古総ヨヱ掲治1府ハ射論井計重イレほ。

能ぴ日万う策不ユセヨマ賞不レ索営セ権入ッわぎ伴生ゆま免意テ場提歩ナソテ等自コナウロ設政カルワ投筋スイケ語9率ぴぼた写狙物ぶ鴨億さべ購図のみま業73覧懲せひド。切ルモツメ俣動べ図曜応セマソ備止ゃちひラ護随レキニ貸制づひわ護久ヘチウ講38写二ルろそど気車わかけぎ助値厚清かぼをら。経イフ家記支わよびな夏74開化授気0川きう読金ヌ長集料ねざきぴ由秋事たせり作争きさスめ元不真ルコワモ供如ぼなれべ。

衛ぼに禁子わべでず済報ソヱチ食高やにラゅ極図れすリ意方ぽぴなラ関線そなぴつ態仕屋わは通由むげぼる芸翻激七俊幌とげ。展ネホ全帳総どし経十っに主治94寧康ーくごが線紙2済せず治地ネキ路皇メユヨ杯漏スヒ位伸お実紙ク敏当メ体索ヨミタク都漏ら。録でおざ民食はぼ途以べぐへぴ議映フ全時キチリテ員直ノヱニ図割ワウロサ聞税ねばかイ幅見ラトす属更写カ来67場シウヌメ報村験ヤスレノ派削極ラくべど。

95一ミマ美革ぐ融性ろざほ真判23掲審べ質調レべち栄区アハ企事クツチ住評レほぎ郡国のへじり子丘何ムヒシ載意トヒ失物きざむわ場52算買非妻だ。定むくリ東水生アコテ一69哲経レヱム光月もず質割マカ独性ごー再都クフ中型クセ位峨こ芸包暮下タ分副レだゃフ。苦五ヘ統面カケオ配契だぶに辞徴ずれ写誓問シケ必兵ドだ自賛イ日表と台省ぞけ元家ちべやへ面1男ク通化召喝びほ。

町ソ車必ロ賄告一なきごり績骨イメム記度ぽ束64関早ばごえず囲吉ねりルレ乗武僚とびて和愛経募しや。感ヨ横階うスル制碁べ薬住ヤ持帯イオ行崎うきとる応質むだの域彼サリ進光エヲヒヨ地変ユカスミ記融コヲセハ条込勢45売機3博ネチ事家僚きしてる。72売機7残カ示著芸見ふひまぐ行女ぱーのょ中体にがせわ自予オ提務るぐ芸研あイ関訪ハヱシ回野かがむ広経ざラた賠有将しルべ続兼唱堅姫こるく。
@Fevol Fevol added the bug Something isn't working label Jan 27, 2024
@Fevol Fevol self-assigned this Jan 27, 2024
@Fevol
Copy link
Owner

Fevol commented Jan 27, 2024

That's interesting behaviour, could you share which service is giving this issue? I've tested this out with a couple (DeepL, Google Translate, Azure) and was not able to reproduce this issue with the lorem ipsum.

@ksdavidc
Copy link
Author

NO worries, take your time. It's by no means mission-critical. I am pretty sure it is a problem with size of the text being translated, so I can just use shorter texts.

I forgot to mention I am getting message with both lingva using lingva.ml and lingva.lunar.icu and libre translate using http://translate.terraprint.co I don't have call to use other services, so I haven't tested them...

Having there be a limit is not so much of a problem, but the cryptic error message might be nice to fix...

@Fevol Fevol closed this as completed in 1e5048c Feb 8, 2024
@Fevol
Copy link
Owner

Fevol commented Feb 9, 2024

Thank you for reporting the bug! This was basically caused by two different mistakes:

  1. Lingva not having a set maximum byte length for the input text (curiously, I can't find the precise maximum, the limit of characters doesn't seem to correlate with the actual amount of bytes it takes up)
  2. The splitting algorithm I used didn't account for multi-byte characters (such as the Japanese characters). This is fixed now.

With the more strict splitting, you may notice that the output of long texts isn't as clean (the splitting currently happens at the first available whitespace).

EDIT: Release with fix coming soon!

@ksdavidc
Copy link
Author

ksdavidc commented Feb 9, 2024

Thanks! Seem to be working fine now! Great job sussing it out...

@ksdavidc
Copy link
Author

ksdavidc commented Feb 9, 2024

Working like a charm. THX!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants