CDATA deserialization #311

Mingun · 2021-08-24T19:28:17Z

It seems that CDATA parsing performed incorrectly. CDATA should contain logically unescaped content (i.e. you don't need to unescape it). So the string <![CDATA[escaped string]]> should be parsed into string escaped string in contrast to the ordinal text content where the result should be escaped string.

According to the https://www.w3.org/TR/xml/#sec-cdata-sect CDATA sections cannot contain any escape sequences. I've checked several online validators (for example https://jsonformatter.org/xml-parser) and they agree with that.

However, quick-xml has different opinion. This test seems invalid:

quick-xml/tests/unit_tests.rs

Lines 179 to 184 in 516f1ea

    
           #[test] 
        
           fn test_cdata_open_close() { 
        
               let mut r = Reader::from_str("<![CDATA[test <> test]]>"); 
        
               r.trim_text(true); 
        
               next_eq!(r, CData, b"test &lt;&gt; test"); 
        
           }

As a consequence, all CDATA processing in the Deserializer is broken because it is simply handled the same as the Text event, and in several places handling of the CDATA just absent.

I preparing a PR with these fixes, but because I'm not working with XML very often, I'm not sure that I'm not wrong. So I open this issue to get feedback.

The text was updated successfully, but these errors were encountered:

Waiting for tafia/quick-xml#311 Fixes: #230

iovxw · 2021-09-02T13:49:14Z

From #244:

The BytesText buffer should always contain escaped string

That's good API design but bad for performance, maybe we need something like quick_xml::events::BytesTextUnescaped or quick_xml::events::BytesCdata?

OliverBrotchie · 2021-09-03T12:29:03Z

This is quite unintuitive behaviour. Is there a quick workaround for this?

From #244:

The BytesText buffer should always contain escaped string

That's good API design but bad for performance, maybe we need something like quick_xml::events::BytesTextUnescaped or quick_xml::events::BytesCdata?

BytesTextUnescaped defiantly seems like a good solution.

OliverBrotchie · 2021-09-04T17:07:19Z

Btw this works with the current API:

use quick_xml::{events::Event, Reader, Writer};
use std::{
    fs,
    io::{self, Cursor},
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = fs::read_to_string("foo.xml")?;
    let mut r = Reader::from_str(&path);
    let mut w = Writer::new(Cursor::new(Vec::new()));
    let mut buf = Vec::<u8>::new();

    // Loop over the xml tags
    loop {
        match r.read_event(&mut buf) {
        
            // Special case for CDATA
            Ok(Event::CData(e)) => {
                w.write(format!(
                    "<![CDATA[{}]]>\n",
                    str::from_utf8(&e.unescaped()?.into_owned())?
                ).as_bytes())?;
            } 
            Ok(Event::Eof) => break,
            Ok(e) => w.write_event(e)?,
            Err(e) => panic!("Error at position {}: {:?}", r.buffer_position(), e),
        }
        buf.clear();
    }
    fs::write("foo.xml", w.into_inner().into_inner())?;
    Ok(())
}

swi2012 · 2021-11-24T21:08:43Z

Hello.
I've got a problem with CData too, but not sure if it's quick-xml problem:
In xml i want to parse i have text inside tags as cdata, but it's look like this (escaped):
<NAME><![CDATA[GLOBAL RELIEF FOUNDATION (GRF)* ]]></NAME>
So Event::CData do not trigger, and Event::Start and read_text inside it gives me string
<![CDATA[GLOBAL RELIEF FOUNDATION (GRF)* ]]>.

I'm wonder how can i get actual text from cdata inside tag?

Mingun · 2021-11-25T04:42:23Z

You have a string representation of one XML inside other XML -- so you need to parse NAME content as an inner XML

swi2012 · 2021-11-25T17:59:32Z

You have a string representation of one XML inside other XML -- so you need to parse NAME content as an inner XML

Thanks. It's worked, but * and , at the end of text remains. Code like clean this garbage for me:

 Ok(Event::CData(t)) => {
                        s = t.unescape_and_decode(&r).unwrap();
                        if !s.is_empty() {
                            let mut ast = Vec::<usize>::new();
                            let tmp: &str = s.trim_end().trim_end_matches(|c| c == '*' || c == ',');
                            if let Some(i) = tmp.find('*') {
                                ast.push(i)
                            }
                            s = tmp.to_string();
                            if ast.len() > 0 {
                                for idx in &ast {
                                    s.remove(*idx);
                                    s.remove(*idx + 1);
                                }
                            }
                            return s;
                        }
                    }

…tead of `BytesText` This commit revert changes from 85f9f68

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [quick-xml](https://github.com/tafia/quick-xml) | dependencies | patch | `0.23.0-alpha3` -> `0.23.0` | | [quick-xml](https://github.com/tafia/quick-xml) | dependencies | minor | `0.22.0` -> `0.23.0` | --- ### Release Notes <details> <summary>tafia/quick-xml</summary> ### [`v0.23.0`](https://github.com/tafia/quick-xml/blob/HEAD/Changelog.md#0230----2022-05-08) - feat: add support for `i128` / `u128` in attributes or text/CDATA content - test: add tests for malformed inputs for serde deserializer - fix: allow to deserialize `unit`s from any data in attribute values and text nodes - refactor: unify errors when EOF encountered during serde deserialization - test: ensure that after deserializing all XML was consumed - feat: add `Deserializer::from_str`, `Deserializer::from_slice` and `Deserializer::from_reader` - refactor: deprecate `from_bytes` and `Deserializer::from_borrowing_reader` because they are fully equivalent to `from_slice` and `Deserializer::new` - refactor: reduce number of unnecessary copies when deserialize numbers/booleans/identifiers from the attribute and element names and attribute values - fix: allow to deserialize `unit`s from text and CDATA content. `DeError::InvalidUnit` variant is removed, because after fix it is no longer used - fix: `ElementWriter`, introduced in [#274](tafia/quick-xml#274) (0.23.0-alpha2) now available to end users - fix: allow lowercase `<!doctype >` definition (used in HTML 5) when parse document from `&[u8]` - test: add tests for consistence behavior of buffered and borrowed readers - fix: produce consistent error positions in buffered and borrowed readers - feat: `Error::UnexpectedBang` now provide the byte found - refactor: unify code for buffered and borrowed readers - fix: fix internal panic message when parse malformed XML ([#344](tafia/quick-xml#344)) - test: add tests for trivial documents (empty / only comment / `<root>...</root>` -- one tag with content) - fix: CDATA was not handled in many cases where it should - fix: do not unescape CDATA content because it never escaped by design. CDATA event data now represented by its own `BytesCData` type ([quick-xml#311](tafia/quick-xml#311)) - feat: add `Reader::get_ref()` and `Reader::get_mut()`, rename `Reader::into_underlying_reader()` to `Reader::into_inner()` - refactor: now `Attributes::next()` returns a new type `AttrError` when attribute parsing failed ([#4](Mingun/fast-xml#4)) - test: properly test all paths of attributes parsing ([#4](Mingun/fast-xml#4)) - feat: attribute iterator now implements `FusedIterator` ([#4](Mingun/fast-xml#4)) - fix: fixed many errors in attribute parsing using iterator, returned from `attributes()` or `html_attributes()` ([#4](Mingun/fast-xml#4)) </details> --- ### Configuration 📅 **Schedule**: At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about these updates again. --- - [ ] If you want to rebase/retry this PR, click this checkbox. --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). Co-authored-by: cabr2-bot <cabr2.help@gmail.com> Co-authored-by: crapStone <crapstone01@gmail.com> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1377 Reviewed-by: crapStone <crapstone@noreply.codeberg.org> Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org> Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>

iovxw added a commit to iovxw/rssbot that referenced this issue Sep 2, 2021

Temporarily downgrade quick-xml to 0.20

b0f09c9

Waiting for tafia/quick-xml#311 Fixes: #230

OliverBrotchie mentioned this issue Sep 3, 2021

XML escape is applied to CDATA OliverBrotchie/ob#1

Closed

Mingun referenced this issue in Mingun/fast-xml Mar 20, 2022

Fix #311: Introduce BytesCData type for CData events and use it ins…

9433830

…tead of `BytesText` This commit revert changes from 85f9f68

Mingun referenced this issue in Mingun/fast-xml Mar 21, 2022

Fix #311: Introduce BytesCData type for CData events and use it ins…

5635a07

…tead of `BytesText` This commit revert changes from 85f9f68

Mingun mentioned this issue Mar 21, 2022

Fix CDATA handling #374

Closed

Mingun referenced this issue in Mingun/fast-xml Mar 27, 2022

Fix #311: Introduce BytesCData type for CData events and use it ins…

572b2c2

…tead of `BytesText` This commit revert changes from 85f9f68

Mingun closed this as completed in 738ddd7 May 21, 2022

plugwash mentioned this issue Dec 22, 2022

patches to support newer quick-xml rust-x-bindings/rust-xcb#216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDATA deserialization #311

CDATA deserialization #311

Mingun commented Aug 24, 2021

iovxw commented Sep 2, 2021

OliverBrotchie commented Sep 3, 2021 •

edited

Loading

OliverBrotchie commented Sep 4, 2021 •

edited

Loading

swi2012 commented Nov 24, 2021 •

edited

Loading

Mingun commented Nov 25, 2021

swi2012 commented Nov 25, 2021

CDATA deserialization #311

CDATA deserialization #311

Comments

Mingun commented Aug 24, 2021

iovxw commented Sep 2, 2021

OliverBrotchie commented Sep 3, 2021 • edited Loading

OliverBrotchie commented Sep 4, 2021 • edited Loading

swi2012 commented Nov 24, 2021 • edited Loading

Mingun commented Nov 25, 2021

swi2012 commented Nov 25, 2021

OliverBrotchie commented Sep 3, 2021 •

edited

Loading

OliverBrotchie commented Sep 4, 2021 •

edited

Loading

swi2012 commented Nov 24, 2021 •

edited

Loading