xref https://github.com/pandas-dev/pandas/issues/17144
I am trying to get concat
(and affiliated functionality) working with the external GeometryBlock (which we are developing in geopandas). This is currently failing, and in such a manner (I think) that it will not be fixable in geopandas.
Currenlty the concat
(and the concatenate_join_units
) functionality has a lot of hardcoded logic about blocks. Two problems that I encountered (so far):
- it does not respect the 'dimensionality' of the block.
- In case of non-consolidatable blocks, the block values are stored in 1D array-likes, while the block is actually 2D. For this, there is some hard-coded logic: in
_concat_compat
(https://github.com/pandas-dev/pandas/blob/34c4ffd7c454848311f71e6869b5cad4bc132449/pandas/core/dtypes/concat.py#L100, called fromconcatenate_join_units
), eg for tz datetimes it defers to_concat_datetimetz
which knows it gets 1D instead of 2D data. The general case however is passed tonp.concatenate(to_concat, axis=axis)
where theaxis
does not take the dimensionality of the Block into account (which leads to an error if your block is non-consolidatable with 1D underlying data) - A possible solution is to check the dimensionality of the data of the block, and if it is 1D take this into account when concatting the data.
-
However, the problem is then still that
np.concatenate
does not fully the right thing. So a further possible solution would be that there is eg a class method on the block to concatenate the blocks. -
In general, Block types are not preserved
- Eg in the case of concatting two Series objects of the same block type, the values are concatted and the block type is re-inferred. This means that the result can never have the externally defined block type without pandas knowing about this block type. (this does not only happen in concat, but in other places as well)
- In the specific case of concat, I think we could certainly check if the block types are equal and in that case preserve the block class.
- More in general, it is difficult to manipulate with BlockManagers while preserving block types. Eg
BlockManager.insert/set
are not able to add a Block while preserving the block class (they always infer the block type based on the data). Other example isconcatenate_block_managers
where the blocks are again reconstructed without taking into account the original block class. - Would it be desirable/acceptable to more retain the block class? (not sure if this would have impact on pandas itself)
@jreback (cc @mrocklin)
Comment From: jorisvandenbossche
@jreback Do you have an opinion about this? We mentioned it very shortly at the end of the dev call, but I don't remember anymore what you said.
Comment From: mrocklin
Thought I'd ping this in case this got lost in @jreback's inbox (which I'm sure is overflowing).
This poses a significant usability issue to the geopandas cythonization process.
Comment From: jorisvandenbossche
@mrocklin FYI, I just started a PR looking at the Series case: https://github.com/pandas-dev/pandas/pull/17728
Comment From: mrocklin
Oh excellent
Comment From: jorisvandenbossche
I think this can be closed now (given the merged PRs and the ExtensionArray interface functionality). Any remaining problems with concatting would be bugs in the ExtensionArray interface.