Research

  • [X] I have searched the [pandas] tag on StackOverflow for similar questions.

  • [X] I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/72726742/how-to-write-parquet-files-with-pandas-that-can-be-read-in-aws-athena

Question about pandas

Doing this:

out_buffer = BytesIO()
input_datafame.to_parquet(out_buffer, index=False, compression="gzip")

Results in this:

file schema:               schema
--------------------------------------------------------------------------------
somID: OPTIONAL INT64 R:0 D:1
SessionID:                 OPTIONAL INT64 R:0 D:1
JobID:                     OPTIONAL INT64 R:0 D:1
JobCreationTime:           OPTIONAL BINARY L:STRING R:0 D:1
ProcessedId:               OPTIONAL BINARY L:STRING R:0 D:1
S3Results:                 OPTIONAL BINARY L:STRING R:0 D:1

Which yields to this:

HIVE_METASTORE_ERROR: com.amazonaws.services.datacatalog.model.InvalidInputException: Error: 
type expected at the position 0 of 'integer' but 'integer' is found. 
(Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)

Is there a way to not have optional fields?

Comment From: datapythonista

This depends on different things. Are you using pyarrow or fastparquet? And what are the types of your columns? For the int64 columns, if you're using the numpy's int64 data type, and not the pandas nullable int Int64, then I don't think the resulting column in parquet should be optional.

Maybe you can check this dataframe:

pandas.DataFrame({'not_nullable': [1, 2], 'nullable': pandas.Series([3, pandas.NA], dtype='Int64')})

and if the not_nullable column is optional when using to_parquet, open an issue, reporting which parquet backend you're using (and the version of it). The problem may be in pyarrow (or fastparquet), since I guess we're using their Schema.from_pandas, but we'll have a look first and move from there.

Does this make sense?

Comment From: WillAyd

I think @datapythonista is correct in that this issue might be with the parquet readers (at least pyarrow).

>>> df = pandas.DataFrame({'not_nullable': [1, 2], 'nullable': pandas.Series([3, pandas.NA], dtype='Int64')})
>>> pa.Table.from_pandas(df).schema
not_nullable: int64
nullable: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 492

I could see the argument that ideally this would show int64 not null for the not_nullable column

@jorisvandenbossche

Comment From: WillAyd

You should as a workaround be able to specify the schema if constructing the Table via pyarrow

https://github.com/apache/arrow/pull/4397/files

Comment From: phofl

Closing for now, please ping to reopen when you can address questions