Pandas version checks
- [X] I have checked that this issue has not already been reported.
- No issues mention
QUOTE_NOTNULL
- No issues mention
- [X] I have confirmed this bug exists on the latest version of pandas.
- [X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import csv
import pandas as pd
df = pd.DataFrame(
[[0, None, np.nan, pd.NA]],
columns=['a', 'b', 'c', 'd'])
print(df.to_csv(quoting=csv.QUOTE_NOTNULL, index=False))
Issue Description
All fields are quoted, even nulls.
"a","b","c","d"
"0","","",""
Expected Behavior
All nulls should be unquoted.
"a","b","c","d"
"0",,,
Another possibility I might have expected is only unquoting None
, as described in the csv
docs, but this doesn't make sense for Pandas IMHO since Pandas has multiple "NULL"s.
"a","b","c","d"
"0",,"",""
Installed Versions
INSTALLED VERSIONS
------------------
commit : b1c2ba793bb1f3d7001de41ca4c73377006b9e4d
python : 3.12.7
python-bits : 64
OS : Linux
OS-release : 5.4.0-200-generic
Version : #220-Ubuntu SMP Fri Sep 27 13:19:16 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_CA.UTF-8
LOCALE : fr_CA.UTF-8
pandas : 3.0.0.dev0+1729.gb1c2ba793b
numpy : 2.1.0.dev0+git20240403.e59c074
dateutil : 2.9.0.post0
pip : 24.2
Cython : None
sphinx : None
IPython : 8.22.2
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
psycopg2 : None
pymysql : None
pyarrow : None
pyreadstat : None
pytest : None
python-calamine : None
pytz : 2024.1
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
Comment From: asishm
Thanks, looks like csv.QUOTE_NOTNULL
was added in python 3.12 along with csv.QUOTE_STRINGS
. The user guide at https://pandas.pydata.org/docs/user_guide/io.html#quoting-compression-and-file-format also only mentions 0,1,2,3
quotingint or csv.QUOTE_ instance, default 0 Control field quoting behavior per csv.QUOTE_ constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
Comment From: rhshadrach
Contributions implementing QUOTE_NOTNULL are welcome!
Comment From: asishm
@rhshadrach should a separate issue be opened for QUOTE_STRINGS ?
Comment From: rhshadrach
I think it's okay either way - fine with having both be this issue. It's okay for a PR to only implement one, it just won't fully close this issue.
Comment From: snitish
@wjandrea it seems QUOTE_NOTNULL (and other QUOTE_*) are already implicitly supported. They are passed through as-is to the csv module's writer. The problem in your case is that there is a na_rep
argument in to_csv
that defaults to ''
(empty string), so the NA
s are being replaced with empty strings and QUOTE_NOTNULL is not taking effect. I explicitly set na_rep
to None
-
>>> print(df.to_csv(quoting=csv.QUOTE_NOTNULL, index=False, na_rep=None))
"a","b","c","d"
"0",,,
and it worked as expected. In light of this, is there anything we really need to fix here? @rhshadrach @asishm
Comment From: rhshadrach
I explicitly set
na_rep
toNone
-
Is this tested? The docstring type-hints na_rep
as a str
, so if we are supporting None
it should be updated. This seems quite useful, so I'm supportive of allowing None
.
Comment From: wjandrea
@rhshadrach I think it would make more sense to have to_csv()
ignore the na_rep
parameter if quoting == csv.QUOTE_NOTNULL
and instead use None
internally, something like this:
def to_csv(
...,
na_rep: str = ''
):
...
_na_rep: str | None = na_rep
if quoting == csv.QUOTE_NOTNULL:
_na_rep = None
formatter = DataFrameFormatter(
...
na_rep=_na_rep,
)
...
Comment From: rhshadrach
In general, I'm negative on us ignoring values specified by users. Isn't that what your suggested approach is doing?
Comment From: wjandrea
@rhshadrach Yes it ignores the na_rep
argument, but if the user specifies csv.QUOTE_NOTNULL
, what else would they expect? I mean, the csv
docs say:
csv.QUOTE_NOTNULL
... quote all fields which are notNone
. ... if a field value isNone
an empty (unquoted) string is written.
Comment From: snitish
I explicitly set
na_rep
toNone
-Is this tested?
I did a quick grep and couldn't find any tests that set na_rep=None
. Agree that supporting (officially) na_rep=None
might be cleaner than ignoring na_rep
when the user specifies csv.QUOTE_NOTNULL
.
Comment From: rhshadrach
@rhshadrach Yes it ignores the
na_rep
argument, but if the user specifiescsv.QUOTE_NOTNULL
, what else would they expect? I mean, thecsv
docs say:
The pandas API often goes beyond the features of Python, and this can cause conflicts. In this particular case pandas allows specifying the NA representation while I believe Python does not (correct me if this is wrong). In such cases, I do not think it is always best to follow the Python API to the letter. At times, disagreeing with Python behavior is appropriate.
Some options:
- Ignore
na_rep
whencsv.QUOTE_NOTNULL
is specified. - Raise when
na_rep
is notNone
whencsv.QUOTE_NOTNULL
is specified. - Change the default of
na_rep
tolib.no_default
, making the real default ofna_rep
dependent on other arguments (None
whencsv.QUOTE_NOTNULL
is specified,""
otherwise). - Keep the current default and behavior of
na_rep=""
, making a user specifyna_rep=None
andcsv.QUOTE_NOTNULL
to get the same behavior as Python's CSV module.
cc @pandas-dev/pandas-core
Comment From: Dr-Irv
I'd vote for option 2:
Raise when na_rep is not None when csv.QUOTE_NOTNULL is specified.
Comment From: wjandrea
@rhshadrach Ah, I see what you mean.
How about a combination of options 2 and 3 to allow the user to specify None
explicitly or implicitly? i.e. if the user specifies csv.QUOTE_NOTNULL
, then for na_rep
:
- unspecified: use None
- None
: continue as normal
- any other value: raise
In code:
def to_csv(
...,
na_rep: str | None | lib.NoDefault = lib.no_default
):
...
_na_rep: str | None
if quoting == csv.QUOTE_NOTNULL:
if na_rep is lib.no_default or na_rep is None:
_na_rep = None
else:
raise ValueError('For "quoting=csv.QUOTE_NOTNULL", "na_rep" must be "None" or unspecified')
else:
_na_rep = '' if na_rep is lib.no_default else na_rep
formatter = DataFrameFormatter(
...
na_rep=_na_rep,
)
...
Comment From: rhshadrach
How about a combination of options 2 and 3...:
- ...
- any other value: raise
It seems to me this would be raising unnecessarily. Why is that preferred over adhering to the value the user specified?
Comment From: wjandrea
@rhshadrach If the function adhered to the value the user specified, what would it do with it? As far as I can figure, if the user specifies csv.QUOTE_NOTNULL
, the only reasonable value for na_rep
is None
.
Comment From: rhshadrach
Why isn't
df = pd.DataFrame([[0, None, np.nan, pd.NA]], columns=['a', 'b', 'c', 'd'])
print(df.to_csv(na_rep="x", index=False, quoting=csv.QUOTE_ALL))
# "a","b","c","d"
# "0",x,x,x
reasonable?
Comment From: wjandrea
@rhshadrach That's reasonable behaviour, but it goes beyond csv.QUOTE_NOTNULL
("an empty (unquoted) string is written") and besides, how would we implement it? .to_csv()
ultimately uses the csv
module, which doesn't support that. The only way to implement csv.QUOTE_NOTNULL
currently is to pass None
for any nulls.
[I posted a comment like this before but just noticed another problem.]