Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDATA deserialization #311

Closed
Mingun opened this issue Aug 24, 2021 · 6 comments
Closed

CDATA deserialization #311

Mingun opened this issue Aug 24, 2021 · 6 comments

Comments

@Mingun
Copy link
Collaborator

Mingun commented Aug 24, 2021

It seems that CDATA parsing performed incorrectly. CDATA should contain logically unescaped content (i.e. you don't need to unescape it). So the string <![CDATA[escaped&#x20;string]]> should be parsed into string escaped&#x20;string in contrast to the ordinal text content where the result should be escaped string.

According to the https://www.w3.org/TR/xml/#sec-cdata-sect CDATA sections cannot contain any escape sequences. I've checked several online validators (for example https://jsonformatter.org/xml-parser) and they agree with that.

However, quick-xml has different opinion. This test seems invalid:

#[test]
fn test_cdata_open_close() {
let mut r = Reader::from_str("<![CDATA[test <> test]]>");
r.trim_text(true);
next_eq!(r, CData, b"test &lt;&gt; test");
}

As a consequence, all CDATA processing in the Deserializer is broken because it is simply handled the same as the Text event, and in several places handling of the CDATA just absent.

I preparing a PR with these fixes, but because I'm not working with XML very often, I'm not sure that I'm not wrong. So I open this issue to get feedback.

iovxw added a commit to iovxw/rssbot that referenced this issue Sep 2, 2021
@iovxw
Copy link

iovxw commented Sep 2, 2021

From #244:

The BytesText buffer should always contain escaped string

That's good API design but bad for performance, maybe we need something like quick_xml::events::BytesTextUnescaped or quick_xml::events::BytesCdata?

@OliverBrotchie
Copy link

OliverBrotchie commented Sep 3, 2021

This is quite unintuitive behaviour. Is there a quick workaround for this?

From #244:

The BytesText buffer should always contain escaped string

That's good API design but bad for performance, maybe we need something like quick_xml::events::BytesTextUnescaped or quick_xml::events::BytesCdata?

BytesTextUnescaped defiantly seems like a good solution.

@OliverBrotchie
Copy link

OliverBrotchie commented Sep 4, 2021

Btw this works with the current API:

use quick_xml::{events::Event, Reader, Writer};
use std::{
    fs,
    io::{self, Cursor},
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = fs::read_to_string("foo.xml")?;
    let mut r = Reader::from_str(&path);
    let mut w = Writer::new(Cursor::new(Vec::new()));
    let mut buf = Vec::<u8>::new();

    // Loop over the xml tags
    loop {
        match r.read_event(&mut buf) {
        
            // Special case for CDATA
            Ok(Event::CData(e)) => {
                w.write(format!(
                    "<![CDATA[{}]]>\n",
                    str::from_utf8(&e.unescaped()?.into_owned())?
                ).as_bytes())?;
            } 
            Ok(Event::Eof) => break,
            Ok(e) => w.write_event(e)?,
            Err(e) => panic!("Error at position {}: {:?}", r.buffer_position(), e),
        }
        buf.clear();
    }
    fs::write("foo.xml", w.into_inner().into_inner())?;
    Ok(())
}

@swi2012
Copy link

swi2012 commented Nov 24, 2021

Hello.
I've got a problem with CData too, but not sure if it's quick-xml problem:
In xml i want to parse i have text inside tags as cdata, but it's look like this (escaped):
<NAME>&lt;![CDATA[GLOBAL RELIEF FOUNDATION (GRF)* ]]&gt;</NAME>
So Event::CData do not trigger, and Event::Start and read_text inside it gives me string
<![CDATA[GLOBAL RELIEF FOUNDATION (GRF)* ]]>.

I'm wonder how can i get actual text from cdata inside tag?

@Mingun
Copy link
Collaborator Author

Mingun commented Nov 25, 2021

You have a string representation of one XML inside other XML -- so you need to parse NAME content as an inner XML

@swi2012
Copy link

swi2012 commented Nov 25, 2021

You have a string representation of one XML inside other XML -- so you need to parse NAME content as an inner XML

Thanks. It's worked, but * and , at the end of text remains. Code like clean this garbage for me:

 Ok(Event::CData(t)) => {
                        s = t.unescape_and_decode(&r).unwrap();
                        if !s.is_empty() {
                            let mut ast = Vec::<usize>::new();
                            let tmp: &str = s.trim_end().trim_end_matches(|c| c == '*' || c == ',');
                            if let Some(i) = tmp.find('*') {
                                ast.push(i)
                            }
                            s = tmp.to_string();
                            if ast.len() > 0 {
                                for idx in &ast {
                                    s.remove(*idx);
                                    s.remove(*idx + 1);
                                }
                            }
                            return s;
                        }
                    }

Mingun referenced this issue in Mingun/fast-xml Mar 20, 2022
…tead of `BytesText`

This commit revert changes from 85f9f68
Mingun referenced this issue in Mingun/fast-xml Mar 21, 2022
…tead of `BytesText`

This commit revert changes from 85f9f68
Mingun referenced this issue in Mingun/fast-xml Mar 27, 2022
…tead of `BytesText`

This commit revert changes from 85f9f68
@Mingun Mingun closed this as completed in 738ddd7 May 21, 2022
crapStone added a commit to Calciumdibromid/CaBr2 that referenced this issue May 30, 2022
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [quick-xml](https://github.com/tafia/quick-xml) | dependencies | patch | `0.23.0-alpha3` -> `0.23.0` |
| [quick-xml](https://github.com/tafia/quick-xml) | dependencies | minor | `0.22.0` -> `0.23.0` |

---

### Release Notes

<details>
<summary>tafia/quick-xml</summary>

### [`v0.23.0`](https://github.com/tafia/quick-xml/blob/HEAD/Changelog.md#&#8203;0230----2022-05-08)

-   feat: add support for `i128` / `u128` in attributes or text/CDATA content
-   test: add tests for malformed inputs for serde deserializer
-   fix: allow to deserialize `unit`s from any data in attribute values and text nodes
-   refactor: unify errors when EOF encountered during serde deserialization
-   test: ensure that after deserializing all XML was consumed
-   feat: add `Deserializer::from_str`, `Deserializer::from_slice` and `Deserializer::from_reader`
-   refactor: deprecate `from_bytes` and `Deserializer::from_borrowing_reader` because
    they are fully equivalent to `from_slice` and `Deserializer::new`
-   refactor: reduce number of unnecessary copies when deserialize numbers/booleans/identifiers
    from the attribute and element names and attribute values
-   fix: allow to deserialize `unit`s from text and CDATA content.
    `DeError::InvalidUnit` variant is removed, because after fix it is no longer used
-   fix: `ElementWriter`, introduced in [#&#8203;274](tafia/quick-xml#274)
    (0.23.0-alpha2) now available to end users
-   fix: allow lowercase `<!doctype >` definition (used in HTML 5) when parse document from `&[u8]`
-   test: add tests for consistence behavior of buffered and borrowed readers
-   fix: produce consistent error positions in buffered and borrowed readers
-   feat: `Error::UnexpectedBang` now provide the byte found
-   refactor: unify code for buffered and borrowed readers
-   fix: fix internal panic message when parse malformed XML
    ([#&#8203;344](tafia/quick-xml#344))
-   test: add tests for trivial documents (empty / only comment / `<root>...</root>` -- one tag with content)
-   fix: CDATA was not handled in many cases where it should
-   fix: do not unescape CDATA content because it never escaped by design.
    CDATA event data now represented by its own `BytesCData` type
    ([quick-xml#&#8203;311](tafia/quick-xml#311))
-   feat: add `Reader::get_ref()` and `Reader::get_mut()`, rename
    `Reader::into_underlying_reader()` to `Reader::into_inner()`
-   refactor: now `Attributes::next()` returns a new type `AttrError` when attribute parsing failed
    ([#&#8203;4](Mingun/fast-xml#4))
-   test: properly test all paths of attributes parsing ([#&#8203;4](Mingun/fast-xml#4))
-   feat: attribute iterator now implements `FusedIterator` ([#&#8203;4](Mingun/fast-xml#4))
-   fix: fixed many errors in attribute parsing using iterator, returned from `attributes()`
    or `html_attributes()` ([#&#8203;4](Mingun/fast-xml#4))

</details>

---

### Configuration

📅 **Schedule**: At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about these updates again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, click this checkbox.

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1377
Reviewed-by: crapStone <crapstone@noreply.codeberg.org>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants