Pandas version checks

  • [X] I have checked that this issue has not already been reported.
    • No issues mention QUOTE_NOTNULL
  • [X] I have confirmed this bug exists on the latest version of pandas.
  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import csv
import pandas as pd

df = pd.DataFrame(
    [[0, None, np.nan, pd.NA]],
    columns=['a', 'b', 'c', 'd'])

print(df.to_csv(quoting=csv.QUOTE_NOTNULL, index=False))

Issue Description

All fields are quoted, even nulls.

"a","b","c","d"
"0","","",""

Expected Behavior

All nulls should be unquoted.

"a","b","c","d"
"0",,,

Another possibility I might have expected is only unquoting None, as described in the csv docs, but this doesn't make sense for Pandas IMHO since Pandas has multiple "NULL"s.

"a","b","c","d"
"0",,"",""

Installed Versions

INSTALLED VERSIONS
------------------
commit                : b1c2ba793bb1f3d7001de41ca4c73377006b9e4d
python                : 3.12.7
python-bits           : 64
OS                    : Linux
OS-release            : 5.4.0-200-generic
Version               : #220-Ubuntu SMP Fri Sep 27 13:19:16 UTC 2024
machine               : x86_64
processor             : x86_64
byteorder             : little
LC_ALL                : None
LANG                  : fr_CA.UTF-8
LOCALE                : fr_CA.UTF-8

pandas                : 3.0.0.dev0+1729.gb1c2ba793b
numpy                 : 2.1.0.dev0+git20240403.e59c074
dateutil              : 2.9.0.post0
pip                   : 24.2
Cython                : None
sphinx                : None
IPython               : 8.22.2
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
blosc                 : None
bottleneck            : None
fastparquet           : None
fsspec                : None
html5lib              : None
hypothesis            : None
gcsfs                 : None
jinja2                : None
lxml.etree            : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
psycopg2              : None
pymysql               : None
pyarrow               : None
pyreadstat            : None
pytest                : None
python-calamine       : None
pytz                  : 2024.1
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
xlsxwriter            : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None

Comment From: asishm

Thanks, looks like csv.QUOTE_NOTNULL was added in python 3.12 along with csv.QUOTE_STRINGS. The user guide at https://pandas.pydata.org/docs/user_guide/io.html#quoting-compression-and-file-format also only mentions 0,1,2,3

quotingint or csv.QUOTE_ instance, default 0 Control field quoting behavior per csv.QUOTE_ constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

Comment From: rhshadrach

Contributions implementing QUOTE_NOTNULL are welcome!

Comment From: asishm

@rhshadrach should a separate issue be opened for QUOTE_STRINGS ?

Comment From: rhshadrach

I think it's okay either way - fine with having both be this issue. It's okay for a PR to only implement one, it just won't fully close this issue.

Comment From: snitish

@wjandrea it seems QUOTE_NOTNULL (and other QUOTE_*) are already implicitly supported. They are passed through as-is to the csv module's writer. The problem in your case is that there is a na_rep argument in to_csv that defaults to '' (empty string), so the NAs are being replaced with empty strings and QUOTE_NOTNULL is not taking effect. I explicitly set na_rep to None -

>>> print(df.to_csv(quoting=csv.QUOTE_NOTNULL, index=False, na_rep=None))
"a","b","c","d"
"0",,,

and it worked as expected. In light of this, is there anything we really need to fix here? @rhshadrach @asishm

Comment From: rhshadrach

I explicitly set na_rep to None -

Is this tested? The docstring type-hints na_rep as a str, so if we are supporting None it should be updated. This seems quite useful, so I'm supportive of allowing None.

Comment From: wjandrea

@rhshadrach I think it would make more sense to have to_csv() ignore the na_rep parameter if quoting == csv.QUOTE_NOTNULL and instead use None internally, something like this:

def to_csv(
    ...,
    na_rep: str = ''
):
    ...
    _na_rep: str | None = na_rep
    if quoting == csv.QUOTE_NOTNULL:
        _na_rep = None

    formatter = DataFrameFormatter(
        ...
        na_rep=_na_rep,
    )
    ...

Comment From: rhshadrach

In general, I'm negative on us ignoring values specified by users. Isn't that what your suggested approach is doing?

Comment From: wjandrea

@rhshadrach Yes it ignores the na_rep argument, but if the user specifies csv.QUOTE_NOTNULL, what else would they expect? I mean, the csv docs say:

csv.QUOTE_NOTNULL ... quote all fields which are not None. ... if a field value is None an empty (unquoted) string is written.

Comment From: snitish

I explicitly set na_rep to None -

Is this tested?

I did a quick grep and couldn't find any tests that set na_rep=None. Agree that supporting (officially) na_rep=None might be cleaner than ignoring na_rep when the user specifies csv.QUOTE_NOTNULL.

Comment From: rhshadrach

@rhshadrach Yes it ignores the na_rep argument, but if the user specifies csv.QUOTE_NOTNULL, what else would they expect? I mean, the csv docs say:

The pandas API often goes beyond the features of Python, and this can cause conflicts. In this particular case pandas allows specifying the NA representation while I believe Python does not (correct me if this is wrong). In such cases, I do not think it is always best to follow the Python API to the letter. At times, disagreeing with Python behavior is appropriate.

Some options:

  1. Ignore na_rep when csv.QUOTE_NOTNULL is specified.
  2. Raise when na_rep is not None when csv.QUOTE_NOTNULL is specified.
  3. Change the default of na_rep to lib.no_default, making the real default of na_rep dependent on other arguments (None when csv.QUOTE_NOTNULL is specified, "" otherwise).
  4. Keep the current default and behavior of na_rep="", making a user specify na_rep=None and csv.QUOTE_NOTNULL to get the same behavior as Python's CSV module.

cc @pandas-dev/pandas-core

Comment From: Dr-Irv

I'd vote for option 2:

Raise when na_rep is not None when csv.QUOTE_NOTNULL is specified.

Comment From: wjandrea

@rhshadrach Ah, I see what you mean.

How about a combination of options 2 and 3 to allow the user to specify None explicitly or implicitly? i.e. if the user specifies csv.QUOTE_NOTNULL, then for na_rep: - unspecified: use None - None: continue as normal - any other value: raise

In code:

def to_csv(
    ...,
    na_rep: str | None | lib.NoDefault = lib.no_default
):
    ...
    _na_rep: str | None
    if quoting == csv.QUOTE_NOTNULL:
        if na_rep is lib.no_default or na_rep is None:
            _na_rep = None
        else:
            raise ValueError('For "quoting=csv.QUOTE_NOTNULL", "na_rep" must be "None" or unspecified')
    else:
        _na_rep = '' if na_rep is lib.no_default else na_rep

    formatter = DataFrameFormatter(
        ...
        na_rep=_na_rep,
    )
    ...

Comment From: rhshadrach

How about a combination of options 2 and 3...:

  • ...
  • any other value: raise

It seems to me this would be raising unnecessarily. Why is that preferred over adhering to the value the user specified?

Comment From: wjandrea

@rhshadrach If the function adhered to the value the user specified, what would it do with it? As far as I can figure, if the user specifies csv.QUOTE_NOTNULL, the only reasonable value for na_rep is None.

Comment From: rhshadrach

Why isn't

df = pd.DataFrame([[0, None, np.nan, pd.NA]], columns=['a', 'b', 'c', 'd'])
print(df.to_csv(na_rep="x", index=False, quoting=csv.QUOTE_ALL))
# "a","b","c","d"
# "0",x,x,x

reasonable?

Comment From: wjandrea

@rhshadrach That's reasonable behaviour, but it goes beyond csv.QUOTE_NOTNULL ("an empty (unquoted) string is written") and besides, how would we implement it? .to_csv() ultimately uses the csv module, which doesn't support that. The only way to implement csv.QUOTE_NOTNULL currently is to pass None for any nulls.

[I posted a comment like this before but just noticed another problem.]