Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: 📝 Add expected answers to DataFrame method examples #12564

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Eason0729
Copy link

Which issue does this PR close?

Closes #12527.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Sep 21, 2024
@@ -225,7 +225,12 @@ impl DataFrame {
/// # async fn main() -> Result<()> {
/// let ctx = SessionContext::new();
/// let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
/// let df = df.select_columns(&["a", "b"])?;
/// df.select_columns(&["a", "b"])?.show().await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about using assert_batches_eq instead so that the output is automatically checked as part of the doc tests rather than relying on us manually keeping it up to date?

It is of similar readability I think. Here is an example:

let results = ctx
.sql("SELECT * FROM test WHERE c1 = $1")
.await?
.with_param_values(vec![ScalarValue::from(3i32)])?
.collect()
.await?;
let expected = vec![
"+----+----+-------+",
"| c1 | c2 | c3 |",
"+----+----+-------+",
"| 3 | 1 | false |",
"| 3 | 10 | true |",
"| 3 | 2 | true |",
"| 3 | 3 | false |",
"| 3 | 4 | true |",
"| 3 | 5 | false |",
"| 3 | 6 | true |",
"| 3 | 7 | false |",
"| 3 | 8 | true |",
"| 3 | 9 | false |",
"+----+----+-------+",
];
assert_batches_sorted_eq!(expected, &results);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original thought is to simplify large long column on doctest(less readable), like this: https://github.com/Eason0729/datafusion/blob/4dd44c5a2b0d9810d7e9163689afab227c58d542/datafusion/core/src/dataframe/mod.rs#L734

Therefore, I simplify example_long.csv in the next commit to the point which is just enough to showcase most method on dataframe.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the expect behavior on describe method is more complex, which might require dedicated csv, so I decided to leave as it is.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @Eason0729 for this contribution 🙏

I have started the CI to check this code

I think the PR would be even better if the output was verified (so it is guaranteed to stay in sync). I left a suggestion on how to do this. Let me know what you think

@@ -184,7 +184,7 @@ impl DataFrame {
}

/// Creates logical expression from a SQL query text.
/// The expression is created and processed againt the current schema.
/// The expression is created and processed against the current schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

/// # #[tokio::main]
/// # async fn main() -> Result<()> {
/// let ctx = SessionContext::new();
/// let df = ctx.read_json("tests/data/unnest.json", NdJsonReadOptions::default()).await?;
/// // expend into multiple columns if it's json array, flatten field name if it's nested structure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// // expend into multiple columns if it's json array, flatten field name if it's nested structure
/// // expand into multiple columns if it's json array, flatten field name if it's nested structure

@comphead
Copy link
Contributor

Thanks @Eason0729 for your contribution.
I'm feeling we need to have something to create DF from rows rather in addition to creating DF from data files.
Example can be

let schema = 
DataFrame::from(schema, data)

Underneath the method can call ctx.read_batch(record_batch). The batch can be created with RecordBatch::try_from_iter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add expected answers to DataFrame method examples
3 participants