[DISCUSSION] libcudf column abstraction redesign #1443

jrhemstad · 2019-04-17T17:07:47Z

Creating and interfacing with the gdf_column C struct is rife with problems. We need better abstraction(s) to make libcudf a safer and more pleasant library to use and develop.

In lieu of adding support for any new column types, the initial goal of a cudf::column design is to ease the current pain points in creating and interfacing with the column data structure in libcudf.

Goals:

Identify pain points with existing gdf_column structure
Derive requirements for an abstraction or set of abstractions to ease those pain points
Define an API design that satisfies the requirements
Provide a working implementation of the design

Non-Goals

Derive requirements to support new column types, e.g., variable width elements, compressed columns, etc.
Support delayed materialization or lazy evaluation

Note that a “Non-Goal” is not something that we want to expressly forbid in our redesign, but rather are not the focus of the current effort. Whenever possible, we can make design decisions that will enable these “Non-Goals” sometime in the future, so long as those decisions do not compromise the timely accomplishment of the above “Goals”

Process

1. Gather pain points

Those who wish to participate should list 3-5 pain points (in priority order) that they would like to solve with the column redesign.
- Note that choosing to participate implies a commitment to putting in the effort to derive requirements and provide feedback on designs, i.e., if you want something to change, you’re expected to put in the work to make it happen.
Pain points should be submitted by responding to this issue.
@jrhemstad will take responsibility for gathering pain points and distilling/organizing based on functional area.
Proposed Deadline: 0.7 release

2. Derive requirements

Distill pain points into satisfiable requirements
@jrhemstad will take responsibility for providing an initial draft of requirements from pain points and distributing for feedback.
Stakeholders will provide feedback on requirements and iterate until consensus is reached on initial requirements
Proposed Deadline: 0.8 Release

3. Design Mock Up

Create draft interface of class(es) that attempt to satisfy requirements.
APIs should be fully Doxymented.
Code does not need to function nor compile
Design should be submitted via a PR to cuDF
TBD will take responsibility for providing an initial interface design
Stakeholders will provide feedback and iterate until consensus is reached on design
Proposed Deadline: 0.8 Release

4. Implementation

Implement the agreed upon interface
Should provide Google Test unit tests
Implementation/testing will likely expose necessary design changes
Implementation should be submitted as a PR to cuDF
TBD will take responsibility for implementing/testing the design
Stakeholders will review implementation PR until consensus is reached
Proposed Deadline 0.8 Release

5. Initial Refactor

Two candidate libcudf features shall be chosen for refactoring to use the new cudf::column abstraction
Two developers (TBD) will take responsibility for refactoring the features (one each) to use the newly designed abstraction(s) and submitting a cuDF PR for review. At least one of the developers shall be different from the developer who designed and implemented the column abstraction.
Any required design changes exposed in refactoring shall be discussed in the PR
Stakeholders will review refactored feature until consensus is reached
TBD will be responsible for creating/amending a style guide with lessons learned and best practices for refactoring a feature using gdf_column to the new abstraction(s)
Proposed Deadline: 0.9 Release

6. Full Refactor

Remaining libcudf features will be refactored one at a time to use the new column abstraction(s)
The style guide mentioned above will be distributed to all libcudf developers to provide guidance in this refactoring effort
This will be an ongoing process that likely will not be fully complete for several releases

The text was updated successfully, but these errors were encountered:

jrhemstad · 2019-04-17T17:10:59Z

My pain points:

Users are burdened with explicitly allocating/freeing device memory for columns. Ownership of the memory is ill-defined and inconsistent.
Basic interfacing with a column requires touching raw pointers. I.e., required to do things like static_cast<T>(col->data), if(nullptr == col->valid), gdf_is_valid(col->valid, i), etc.
Allocating and interacting with the null bitmask requires explicit action by users such as calling gdf_valid_allocation_size and gdf_num_bitmask_elements.

kkraus14 · 2019-04-17T17:12:19Z

Pain Points

Numba is currently used all over for memory allocation / deallocation and would MUCH rather that owned by libcudf but maintain interoperability with Numba Device Arrays.
Python column classes are completely disjointed from libcudf. Should ideally reference a cudf::column object to be used via Cython for better cudf<->libcudf interop.
Type specific implementations currently exist in Python for types like Strings and Categoricals, but ideally should be in cudf::column for better maintainability and cross-language functionality.

eyalroz · 2019-04-17T19:10:34Z

Pain points

Baked-in meta-data/statistics: A null count might be a useful thing, but - so are the min and max values, or the number of distinct values. These should be in a separate data structure, or some mix-in that a subclass of the column class could adopt.
No single-bit-width columns. Factor-8 waste of space? That's outrageous; egregious; preposterous.
Huge amount of idiosyncratic code for the validity indicators. We could just use the fact that it's another (non-nullable) column and invoke existing code (which should support 1-bit-wide columns).
No compositionality. For example, a column of strings (or variable-length data) could be representedby a column of lengths and a column of offsets, offsets into a column of single characters. (not the only possible representation); a nullable column - discussed in the previous item (and again, that's not the only reasonable representation) <- Addressing this fully would be a gargantuan task. But ignoring it completely is not a good idea either.
A supposed "column" today can be an invalid state, so you have to check columns whenever you get them in.
Not using modern C++ vocabulary types like optional and variant to express aspects of columns' semantics. It's the idiomatic thing to do.
Difficult to switch from working with type-erased to type-imbued columns and vice versa.
Mutability - it isn't well defined, nor enforced, who and when, if at all, should be able to change a column (the host-side members and the on-device data) once it's been constructed.
Column names. This may be a nice feature for debugging, but in reality - columns would be identified by some sort of taxonomy, e.g. schema+table+intra-table name - not by a single name. And it's not very useful to just keep the third name. We certainly don't want people to start encoding complex structures into the name. And in our code - we don't know what names to give output columns anyway. So let's drop the names

harrism · 2019-04-17T21:35:47Z

Pain Points

Mostly reiterating what I think are the most important things to fix first. Basically, the problem is that gdf_column lacks encapsulation.

Explicit, manual, and low-level allocation and deallocation of column memory. Numba iswas used for this on the Python side, which required RMM to provide a monkey-patched numba workalike API for Python-side numba.cuda.device_array allocation.
Returning new columns from cudf algorithms is risky because most routines still take pre-allocated column memory as an argument. Users may think that because it's returned they don't have to free the memory, which would lead to leaks.
Managing valid masks is error prone and explicit.
Because it's not a proper class, the API for columns is via free functions, which mostly have silly names and inconsistent interfaces.

revans2 · 2019-04-17T21:36:37Z

It is not clear who owns a block of memory and who does not.
- It is nice to have the ability to ignore the functions provided and take ownership of the memory to do things like coalescing allocations and host to the device transfers.
Documentation. I am okay with having the guts of the column on display. The issue is that I have to reach into the guts to move data back and forth between the host and device, and it is not explained anywhere fully what those guts really are. What the contract is I have to follow. This is really mostly around strings and string categories where you have to special case them. But it becomes more critical as we add in things outside of the scope of this redesign.
Sorry I really only have 2 for now.

harrism · 2019-04-17T21:55:55Z

I have an idea regarding step 5. Revise to:

5. Initial Refactor

Two candidate libcudf features shall be chosen for refactoring to use the new cudf::column abstraction
Two developers (TBD) will take responsibility for refactoring the features (one each) to use the newly designed abstraction(s) and submitting a cuDF PR for review. At least one of the developers shall be different from the developer who designed and implemented the column abstraction.
Any required design changes exposed in refactoring shall be discussed in the PR...

I think this will help flush out more issues in the abstraction. We could keep it at one candidate feature, done by a different engineer than the column implementer, but two would be nice.

harrism · 2019-04-17T21:57:13Z

I also would like to tighten up the schedule a bit. I would like to get the basic abstraction designed and implemented by 0.8 release. If we can't do that, then I think the scope is too broad and should be narrowed. Since most of our pain points agree, I don't think the scope is too large at this point.

jrhemstad · 2019-04-17T22:01:03Z

Updated with @harrism's feedback about schedule and the initial refactor process.

jlowe · 2019-04-17T22:27:52Z

Pain Points

Device memory management is explicit and error-prone, and I would like to see automated output allocation. However there are some use-cases with pre-allocated output targets, and it would be nice to not preclude those.
Related to the previous, the management of CPU-side objects associated with gdf_column can be messy. Who is responsible for freeing the column name? Should it be possible to create a string category column without a corresponding NVCategory and if/when that should be freed when the column is freed?
Validity masks are exposed more than I think they need to be.
The lack of a C++ wrapper that bundles the relevant methods and simplifies proper handling of resource allocations within code that can throw.

eyalroz · 2019-04-18T08:22:45Z

@jlowe and everyone else: About column names - do we actually use column names at all in libcudf? I'm not asking about Python code, just the C/C++.

kovaltan · 2019-04-18T08:52:27Z

My pain points:

bitmap iterator and/or class (validity bitmask handling nulls)
The rack of bitmap iterator makes the implementation messy when the API handles the nulls.
It would be easy to handle nulls if we use const bitmap iterator with transform iterator.
Also validity bitmask class will be useful to manage nulls including allocating/deallocating helper
input colum check at each function
gdf functions usually have sanity check for input/output columns, it's better if cudf::column has sanity check API:
GDF_REQUIRE(nullptr != col_in, GDF_DATASET_EMPTY);
GDF_REQUIRE(nullptr == col_in->valid || 0 == col_in->null_count, GDF_VALIDITY_UNSUPPORTED);
or
CUDF_EXPECTS(col != nullptr, "Input column is null");
Returning new columns from cudf algorithms is risky
Same as @harrism
The allocation of the columns should be out side of the algorithms.
cudf::test::column_wrapper<cudf::wrapper types> does not accept integer initializer like {0,1,2,3}
It is out of topic, it's better we have.

eyalroz · 2019-04-18T09:55:58Z

@kovaltan :

You suggest that the allocation of columns should occur outside of "the algorithms" (by which I take it you mean the implementation of cudf API functions). But is that a pain point about the column abstraction we use? And - do you think this should be resolved through the design of a column class?
About the input checks: Note that, for now, we have PR [REVIEW] Adding some trivial gdf_column-related utility functions #1390, and specifically validate(my_gdf_column) which performs all of these checks.

kovaltan · 2019-04-18T11:54:25Z

@eyalroz ,

You suggest that the allocation of columns should occur outside of "the algorithms" (by which I take it you mean the implementation of cudf API functions). But is that a pain point about the column abstraction we use? And - do you think this should be resolved through the design of a column class?

I mean "the algorithms" like cudf::scan(), does not mean cudf::column class.
For now, we passes pre-allocate output column to cudf::scan().
I think it should be pre-allocated, should not be allocated in cudf::scan().
cudf::column class can allocate device memory, and may have device_memory_owner flag.
cudf::column should free the device memory if device_memory_owner = true.

About the input checks: Note that, for now, we have PR [REVIEW] Adding some trivial gdf_column-related utility functions #1390, and specifically validate(my_gdf_column) which performs all of these checks.

It's really nice!

jlowe · 2019-04-18T13:11:19Z

@eyalroz

About column names - do we actually use column names at all in libcudf?

The column names are filled in by the csv and parquet parsers. There needs to be some way to convey the names discovered during parsing back to the caller. However I could see some debate whether it really needs to be stored directly in the column structure.

revans2 · 2019-04-18T14:18:13Z

@eyalroz and @kovaltan
I do agree that a column should have some knowledge about ownership of the memory it references.
I also realize that

Support delayed materialization or lazy evaluation

is not a part of the design goals here. But I think it will become an important feature likely combined with JITing so we should at least think ahead to how this might work and not paint ourselves into a corner. One of the key pieces of doing the lazy evaluation is that it can eliminate intermediate results and as such will not need to allocate the corresponding memory.

So if we want to be able to eliminate memory allocation we cannot pre-allocate it for every operation. I would propose, instead, that the operation itself do the device memory allocation. Not allocate the column class, but allocate the device memory the column ultimately points to. We could structure the column class so that it knows how to "allocate" and "free" memory which can be overridden by the user creating it. That should give us the flexibility to do just about anything we want with memory management of the columns, but also lets the algorithms themselves decide when to allocate memory, and even how much memory to allocate.

jrhemstad · 2019-04-18T14:34:49Z

Returning new columns from cudf algorithms is risky
Same as @harrism
The allocation of the columns should be out side of the algorithms.

@kovaltan I think you may have misunderstood Mark's comment here:

Returning new columns from cudf algorithms is risky because most routines still take pre-allocated column memory as an argument. Users may think that because it's returned they don't have to free the memory, which would lead to leaks.

I believe he was referring to how presently returning a gdf_column from a function is risky because we don't have any gdf_column destructor, so the burden falls on the caller to free the column's underlying memory.

For example:

// This is leaky because it requires the caller to remember to free the column's device memory
gdf_column output = gdf_algorithm(input...);

However, we have identified that is desirable to be able to return outputs from functions rather than use output parameters (for various reasons: clearer style, delayed JIT evaluation, etc.). Therefore, we want to make that safer via ownership semantics.

As such, we likely will be moving in the direction that algorithms allocate the device memory for outputs.

To be clear:

// This is safe because lifetime management of the device memory is automated
cudf::column output = cudf::algorithm( input... );

eyalroz · 2019-04-18T21:06:17Z

@revans2 :

I ... realize that .... delayed materialization or lazy evaluation ... is not a part of the design goals here. But I think it will become an important feature ... so we should at least think ahead to how this might work and not paint ourselves into a corner.

You are preaching to the choir - I definitely agree (in fact I would love to work on faster JITing using PTX, regardless of cudf). This part of why I've proposed removing some of the features of columns for the basic/fundamental abstraction - flexibility for later use.

One of the key pieces of doing the lazy evaluation is that it can eliminate intermediate results and as such will not need to allocate the corresponding memory. So if we want to be able to eliminate memory allocation we cannot pre-allocate it for every operation.

But why are we certain that the allocating code will be part of what gets JITed together? Clearly, to construct operators by JITting we would use composable building-blocks which are not the functions in cudf/functions.h. So I would not be worried about this point.

I would propose, instead, that the operation itself do the device memory allocation. Not allocate the column class, but allocate the device memory the column ultimately points to. We could structure the column class so that it knows how to "allocate" and "free" memory which can be overridden by the user creating it.

I'm thinking about a design which is reminiscent of std::unique_ptr - where a choice of a different template parameter allows for a different de/allocation mechanism. If I followed you correctly, I think that should allow for the behavior you described.

eyalroz · 2019-04-18T21:16:58Z

@jlowe :

The column names are filled in by the csv and parquet parsers. There needs to be some way to convey the names discovered during parsing back to the caller.

Certainly, but the parsers are not part of libcudf; so why should this extra information be passed to libcudf if it doesn't get used there at all?

harrism · 2019-04-19T00:53:12Z

Certainly, but the parsers are not part of libcudf; so why should this extra information be passed to libcudf if it doesn't get used there at all?

Since when? The CSV, parquet, and soon ORC and Jason-lines parsers are all part of libcudf.

eyalroz · 2019-04-19T07:29:29Z

@harrism : Yes, right, I misspoke. That is part of the libcudf codebase. But the library itself, or rather the API functions which process columns, don't take CSVs and parse them, nor do they make calls to CSV-parsing code. And the names aren't used (it seems).

So, what if the names were to simply be parsed into a separate array in the csv_read_args structure?

kovaltan · 2019-04-19T10:27:00Z

@jrhemstad

Ah, I've misunderstood Mark's comment.
I have only one concern about allocating device memory inside of cudf::algorithm().
If allocating is in cudf::algorithm(), it's hard for users to manage device memory allocation by themselves,
though I'm not sure if the users want, and there would be some way to pass user defined allocator to cudf::algorithm().

'''
// This is safe because lifetime management of the device memory is automated
cudf::column* output = cudf::algorithm( input... );
'''

Pros:

easy to use, users can use output column not to allocate it by themselves
it is safety deleting cudf::column can free underlying device memory
input/output dtype/size/nulls validation is not needed

Cons:

it's hard for users to manage device memory allocation by themselves

jrhemstad · 2019-04-19T13:45:18Z

I have only one concern about allocating device memory inside of cudf::algorithm().
If allocating is in cudf::algorithm(), it's hard for users to manage device memory allocation by themselves, though I'm not sure if the users want, and there would be some way to pass user defined allocator to cudf::algorithm().

You've alluded to the solution. It's orthogonal to the column design, but ultimately we would like it such that you can pass a custom allocator into any cudf algorithm, similar to how you can with Thrust.

OlivierNV · 2019-04-20T05:27:09Z

Regarding column names: I think it should eventually not be part of the column itself but part of a parent array of column that would include the schema, containing column names and their relationship (will also allow preserving schema information when reading/writing columnar formats).

nsakharnykh · 2019-04-22T14:42:04Z

Pain points

Touching raw pointers and underlying data structures of the column
a. Difficult to track down out-of-bounds and null access errors
b. Difficult to evaluate other internal column formats (e.g. a list of chunks instead of one flat array)
c. Difficult to update the validity mask from a CUDA kernel
Memory management
a. Who owns the memory allocated for the column, when it should be freed
b. What happens when someone is trying to copy data (shallow vs deep copy semantics)
c. What are the mechanisms to realloc the column (resize/expand)

eyalroz · 2019-04-25T19:31:52Z

it should provide common functionality to those system. Cudf already does some of this with partition, which has really no value outside of a system like these.

The thing is, libcudf doesn't provide all possible common functionality, and AFAIK it's not supposed to. Granted, though, the exact scope of libcudf doesn't seem to be put in writing well enough (certainly not in the repository's README.md...)

Also, again, can you be specific about concrete optimization opportunities which will be missed due to columns being immutable after construction?

thomcom · 2019-05-02T18:33:32Z

Pain points

gdf_columns are specifically that - columns. This makes indexing by rows, a common operation, expensive and meticulous at the python level of cudf. Transpose is now supported, but is an extremely expensive operation. Dozens of Pandas API calls support an axis argument, trivially alternating between the index or columns of a Dataframe or Series. It would be nice if the new take on gdf_column wasn't, literally, columns, but supported the ability to "swap axis" in an inexpensive single operation.
Names - Pandas MultiIndex allows for multiple columns with the same name - something that can't be achieved easily in our current setup. We need an abstraction that is more complex than 1:1 between columns-as-allocated and the set of column-identifiers. A column identifier that is simultaneously a name and an index might cover all necessary use cases.
Width x Height concerns - This is maybe a long shot, but CUDF is severely limited by dataframe orientation. Wide matrices quickly lose their edge over CPU based computing. Relating to Collaborations on columnar data structures #1 - columns should have arbitrary axes, and the most efficient implementation should be chosen either by the developer or by the algorithm.
There's a lot of complaints about memory management, for good reason. I work 95+% of the time in python so 😎 but memory pools, reference counting and garbage collection, anyone? Objective-C makes a great example of how to do this well. Our gains are found during the GPU accelerated convolution of many frame values - I don't think that performance loss from GC would be noticeable.

eyalroz · 2019-05-02T20:22:37Z

@thomcom : Following your listing of points:

Can you explain do you mean by "indexing by rows"? Do you mean accessing tuples - individual rows of a table - rather than dealing with disparate columns?
Also, as for transposing - would mean replacing, say, a single column with 1 million elements with 1 million columns with a single element each. Certainly that's not supported... at least not in libcudf :-(
I think you're talking about the Python side here, because in libcudf we don't access columns by name. I also think you can even read from a CSV with multiple columns having the same name (not sure though).
This relates to what you said about transposition. Essentially, libcudf handles columns, not two- or multi-dimensional data; and this is central to the design, beyond the column abstraction even. A library for multi-dimensional data is different kind of beast than a library for manipulating columns; so you're suggesting a change in the library's entire scope. I am not for or against such a change, but here we're just considering a column abstraction for libcudf with its current scope. However, you might consider the encoding of a matrix in a three-column table, with columns for "matrix_row", "matrix_column" and "value:.
Modern C++ avoids the need for garbage collection, using smart pointers and other RAII classes, which are freed exactly when they stop being used. This is (mostly) encapsulated in the Resource Management section of the Core C++ Programming Guidelines. You could argue that this is insufficient/inappropriate for our case and a garbage collector is required - but you would need to make a convincing argument, since a garbage collector is a large amount of very serious code, with all sorts of overhead (for developers and at runtime).

harrism · 2019-05-03T00:35:35Z

Actually @eyalroz we DO have transpose in libcudf but it's slow for large input columns, for obvious reasons. In general, @thomcom I think we would love to be able to provide efficient multi-axis indexing, but if you think about it, this is a very hard problem. It's easy in, say tensor libraries because everything in the tensor has the same datatype. But in a typed language, how do you efficiently transpose a table of columns with different types so you can access the row memory efficiently? You end up with columns that have to support multiple data types. Not pretty however you take it on...

eyalroz · 2019-05-03T08:03:59Z

Actually @eyalroz we DO have transpose in libcudf but it's slow for large input columns, for obvious reasons.

Hmm, cognitive dissonance on my part there. Couldn't bring myself to remember something which didn't make sense... #1603 .

thomcom · 2019-05-03T14:14:11Z

Thanks for your response @eyalroz.

From the perspective of your average ML developer, accessing a tuple - a row of a table, doesn't have a real distinction from accessing a column. A.T.T = A after all. From a math perspective the orientation only matters to determine order of operations. At the user level it would be nice to have the ability to treat rows and columns interchangeably, including, at least, fast transpose. I partially understand your perspective coming from a database design standpoint.
Ok - at the python level column naming has been causing me some real headaches - sounds like it is my problem (literally).
This is specifically why I wanted to bring it up and at least get it thought about here. Following up on @harrism as well - it is possible to encode rows and columns together in contiguous data and index into it in O(1). I agree it could get complicated, and obviously single-type columns are much more efficient than columns with mixed type. I want you guys thinking about it and knowing that Pandas people are asking for it, because transpose and row/column major conversions are used very frequently and they're slow like a barge in python, unless the library is written to just swap labels in python-native data (like Pandas). Then it's just slow like python.

I'm sorry if I'm way out in left field here. I have a fantasy that with careful design we could, for example, transpose 10GB of columns in a few operations, simply by shuffling labels around. This would be powerful.

The most common pain point has been memory management. I'm not terribly up-to-date in C++11 or 14, but I don't get the impression that "Do RAII better" is the answer for our many developers. I suggested reference counting and memory pools because it sounded like the best of both worlds in this context. Developers can simply pull new objects cudf::column cool_column = coolalg(args...) and not worry about managing them, or they can allocate a memory pool and do whatever manual allocations they want, trusting that when the memory pool has been destructed then they have cleaned up after themselves.

thomcom · 2019-05-03T14:22:42Z

In regards to all my "lets support matrix styled data" - if this is gdf_column and it is going to stay column only, I understand. :)

jrhemstad · 2019-05-22T02:34:46Z

@revans2 @jlowe @felipeblazing

@harrism and I are discussing whether or not an allocator should be exposed to libcudf users to control how columns are allocated. I.e., should the libcudf functions accept an allocator parameter to control how device memory is allocated for output cudf::columns that are returned from libcudf functions.

If RMM w/ pool mode is "fast enough" such that memory allocations are no longer a bottleneck, then is there another reason you'd need control over how the memory is allocated? Would it be a showstopper to not provide control over how the device memory is allocated? (Just making sure we're covering our bases and understand all the use cases).

eyalroz · 2019-05-22T06:37:55Z

@jrhemstad : What about @felipeblazing 's allocator use scenarios mentioned in this comment?

jlowe · 2019-05-22T14:39:47Z

The two cases I can think of where allocators could be useful:

Managing one's own temp space as @felipeblazing called out here. I've heard RMM needs to synchronize on a stream for some allocation/free scenarios even in pool mode. For that use case, would RMM's pool mode and a sequence of allocations and frees of the same size be essentially just as fast as the manually managed temp space that was allocated once?
Building a large column by concatenating a series of outputs as they are generated. For example, there are 5 columns that each need some operation performed and the outputs concatenated into one big column at the end. Instead of paying for 2x the output space and the full copy via gdf_column_concat, the code could allocate the large output column and execute the 5 operations into the output column's space at the appropriate offset for each. Yes, there are issues with this when validity vectors are involved and the column sizes aren't a multiple of 8. It also violates the principle of column immutability as the app code mutates the final output column over the 5 operations, so I could see how one could argue this isn't a valid use-case.

harrism · 2019-06-21T22:50:47Z

The design is in progress, but moving this to 0.9 to keep it open.

jrhemstad added feature request New feature or request help wanted proposal Change current process or code libcudf Affects libcudf (C++/CUDA) code. labels Apr 17, 2019

jrhemstad mentioned this issue Apr 17, 2019

[DISCUSSION] Better column abstraction in libcudf #1304

Closed

seunghwak mentioned this issue Apr 19, 2019

[REVIEW] Bug ext create column rapidsai/cugraph#232

Merged

kkraus14 mentioned this issue May 13, 2019

[WIP] pyarrow.cuda backend support #313

Closed

jrhemstad mentioned this issue May 17, 2019

[FEA] Reducing memory overhead for concat via shallow copy #1787

Closed

beckernick mentioned this issue May 22, 2019

[DISCUSSION] Leveraging CuPy for transposing large numeric dataframes #1822

Closed

jrhemstad mentioned this issue May 23, 2019

[QST] Integrating in C/C++ project #1837

Closed

harrism assigned jrhemstad May 23, 2019

jrhemstad mentioned this issue Jun 17, 2019

[REVIEW] Memory resource design rapidsai/rmm#96

Merged

1 task

jrhemstad mentioned this issue Jul 9, 2019

cudf::column redesign #2207

Merged

kkraus14 closed this as completed in #2207 Oct 2, 2019

afender mentioned this issue Oct 2, 2019

[FEA] Get rid of gdf_column rapidsai/cugraph#538

Closed

9 tasks

[DISCUSSION] libcudf column abstraction redesign #1443

[DISCUSSION] libcudf column abstraction redesign #1443

Comments

jrhemstad commented Apr 17, 2019 • edited Loading

Process

1. Gather pain points

2. Derive requirements

3. Design Mock Up

4. Implementation

5. Initial Refactor

6. Full Refactor

jrhemstad commented Apr 17, 2019 • edited Loading

kkraus14 commented Apr 17, 2019

Pain Points

eyalroz commented Apr 17, 2019 • edited Loading

Pain points

harrism commented Apr 17, 2019

Pain Points

revans2 commented Apr 17, 2019

harrism commented Apr 17, 2019

5. Initial Refactor

harrism commented Apr 17, 2019

jrhemstad commented Apr 17, 2019

jlowe commented Apr 17, 2019

Pain Points

eyalroz commented Apr 18, 2019

kovaltan commented Apr 18, 2019

eyalroz commented Apr 18, 2019

kovaltan commented Apr 18, 2019

jlowe commented Apr 18, 2019

revans2 commented Apr 18, 2019

jrhemstad commented Apr 18, 2019 • edited Loading

eyalroz commented Apr 18, 2019

eyalroz commented Apr 18, 2019 • edited Loading

harrism commented Apr 19, 2019

eyalroz commented Apr 19, 2019 • edited Loading

kovaltan commented Apr 19, 2019

jrhemstad commented Apr 19, 2019

OlivierNV commented Apr 20, 2019

nsakharnykh commented Apr 22, 2019

Pain points

eyalroz commented Apr 25, 2019

thomcom commented May 2, 2019 • edited Loading

Pain points

eyalroz commented May 2, 2019 • edited Loading

harrism commented May 3, 2019

eyalroz commented May 3, 2019

thomcom commented May 3, 2019 • edited Loading

thomcom commented May 3, 2019 • edited Loading

jrhemstad commented May 22, 2019 • edited Loading

eyalroz commented May 22, 2019

jlowe commented May 22, 2019

harrism commented Jun 21, 2019

jrhemstad commented Apr 17, 2019 •

edited

Loading

jrhemstad commented Apr 17, 2019 •

edited

Loading

eyalroz commented Apr 17, 2019 •

edited

Loading

jrhemstad commented Apr 18, 2019 •

edited

Loading

eyalroz commented Apr 18, 2019 •

edited

Loading

eyalroz commented Apr 19, 2019 •

edited

Loading

thomcom commented May 2, 2019 •

edited

Loading

eyalroz commented May 2, 2019 •

edited

Loading

thomcom commented May 3, 2019 •

edited

Loading

thomcom commented May 3, 2019 •

edited

Loading

jrhemstad commented May 22, 2019 •

edited

Loading