-
Notifications
You must be signed in to change notification settings - Fork 20
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add DataFrame.persist, and notes on execution model (#307)
* wip: add notes on execution model * reword * remove column mentions for now * remove to_array * use persist instead * remove note on propagation * update purpose and scope * reduce execution_model * Update spec/API_specification/dataframe_api/dataframe_object.py Co-authored-by: Ralf Gommers <ralf.gommers@gmail.com> * Update spec/purpose_and_scope.md --------- Co-authored-by: Ralf Gommers <ralf.gommers@gmail.com>
- Loading branch information
1 parent
e310573
commit 7be00b6
Showing
4 changed files
with
87 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# Execution model | ||
|
||
## Scope | ||
|
||
The vast majority of the Dataframe API is designed to be agnostic of the | ||
underlying execution model. | ||
|
||
However, there are some methods which, depending on the implementation, may | ||
not be supported in some cases. | ||
|
||
For example, let's consider the following: | ||
```python | ||
df: DataFrame | ||
features = [] | ||
for column_name in df.column_names: | ||
if df.col(column_name).std() > 0: | ||
features.append(column_name) | ||
return features | ||
``` | ||
If `df` is a lazy dataframe, then the call `df.col(column_name).std() > 0` returns | ||
a (ducktyped) Python boolean scalar. No issues so far. Problem is, | ||
what happens when `if df.col(column_name).std() > 0` is called? | ||
|
||
Under the hood, Python will call `(df.col(column_name).std() > 0).__bool__()` in | ||
order to extract a Python boolean. This is a problem for "lazy" implementations, | ||
as the laziness needs breaking in order to evaluate the above. | ||
|
||
Dask and Polars both require that `.compute` (resp. `.collect`) be called beforehand | ||
for such an operation to be executed: | ||
```python | ||
In [1]: import dask.dataframe as dd | ||
|
||
In [2]: pandas_df = pd.DataFrame({"x": [1, 2, 3], "y": 1}) | ||
|
||
In [3]: df = dd.from_pandas(pandas_df, npartitions=2) | ||
|
||
In [4]: scalar = df.x.std() > 0 | ||
|
||
In [5]: if scalar: | ||
...: print('scalar is positive') | ||
...: | ||
--------------------------------------------------------------------------- | ||
[...] | ||
|
||
TypeError: Trying to convert dd.Scalar<gt-bbc3..., dtype=bool> to a boolean value. Because Dask objects are lazily evaluated, they cannot be converted to a boolean value or used in boolean conditions like if statements. Try calling .compute() to force computation prior to converting to a boolean value or using in a conditional statement. | ||
``` | ||
|
||
Whether such computation succeeds or raises is currently not defined by the Standard and may vary across | ||
implementations. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,3 +8,4 @@ Design topics & constraints | |
backwards_compatibility | ||
data_interchange | ||
python_builtin_types | ||
execution_model |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters