Pandas DOC: Set value for undefined variables in examples

xref #22900

Ideally we'd like to make all the examples in our documentation runnable. Seems like in some cases we're using variables without defining them. We should fix them if they are typos, or unimported modules (except np and pd), or we should provide a value if it's just a variable we didn't define.

We can get the list of problems with the next command:

$ flake8 --doctest --ignore="" --select=F821 pandas/ | grep -v "F821 undefined name 'pd'" | grep -v "F821 undefined name 'np'"
pandas/core/apply.py:1339:9: F821 undefined name '_relabel_result'
pandas/core/apply.py:1339:33: F821 undefined name 'func'
pandas/core/arrays/datetimelike.py:262:13: F821 undefined name 'self'
pandas/core/dtypes/base.py:266:27: F821 undefined name 're'
pandas/core/generic.py:2247:17: F821 undefined name 'df'
pandas/core/generic.py:5890:13: F821 undefined name 'func'
pandas/core/generic.py:5890:18: F821 undefined name 'g'
pandas/core/generic.py:5890:20: F821 undefined name 'h'
pandas/core/generic.py:5890:22: F821 undefined name 'df'
pandas/core/generic.py:5890:32: F821 undefined name 'a'
pandas/core/generic.py:5890:41: F821 undefined name 'b'
pandas/core/generic.py:5890:49: F821 undefined name 'c'
pandas/core/generic.py:5894:14: F821 undefined name 'df'
pandas/core/generic.py:5894:22: F821 undefined name 'h'
pandas/core/generic.py:5895:22: F821 undefined name 'g'
pandas/core/generic.py:5895:30: F821 undefined name 'a'
pandas/core/generic.py:5896:22: F821 undefined name 'func'
pandas/core/generic.py:5896:33: F821 undefined name 'b'
pandas/core/generic.py:5896:41: F821 undefined name 'c'
pandas/core/generic.py:5903:14: F821 undefined name 'df'
pandas/core/generic.py:5903:22: F821 undefined name 'h'
pandas/core/generic.py:5904:22: F821 undefined name 'g'
pandas/core/generic.py:5904:30: F821 undefined name 'a'
pandas/core/generic.py:5905:23: F821 undefined name 'func'
pandas/core/generic.py:5905:43: F821 undefined name 'a'
pandas/core/generic.py:5905:51: F821 undefined name 'c'
pandas/io/formats/style.py:1873:25: F821 undefined name 'ret'
pandas/io/formats/style_render.py:1279:49: F821 undefined name 'upper'

Comment From: kkangs0226

take

Comment From: MarcoGorelli

Hey

Wouldn't fixing EX02 also end up fixing these?

Comment From: datapythonista

Yes, good point, thanks @MarcoGorelli

I'd say that fixing the errors here will also help fix the docstring validations EX02 and EX03. I created another issue with the remaining errors in the examples. I think we can add the validation once both are completed.

@kkangs0226 I created two issues so two people could work in parallel, since these issues aren't trivial and will take some time. Please don't assign two issues to yourself at the same time, so other people can work on the one you aren't working yet.do you mind releasing the one you aren't workimg in yet? Thank you!

Comment From: kkangs0226

@datapythonista Got it, I was looking at both issues because they looked related. I will release this issue for other developers to work on it!

Comment From: datapythonista

@datapythonista Got it, I was looking at both issues because they looked related. I will release this issue for other developers to work on it!

No problem at all. And surely feel free to work in both, but I think they' may take longer than what it seems, so better to start with one leaving the other free in case someone else is available to work on it in the meantime. Thanks for helping with this!

Comment From: anshuman0123

assign me please . I am new to github

Comment From: DeaMariaLeon

Ideally we'd like to make all the examples in our documentation runnable.

@datapythonista almost all the examples are runnable, as we have been working a lot to remove EX02 errors. There are only about 13 functions that still need fixing. As flake8 errors go, I think we have to avoid the extra parentheses that have been added to a few functions. On a side note, flake8 throws "PDF023 found assignment to single-letter variable". But the documentation guidelines tell us to use single-letter variables.

Comment From: MarcoGorelli

that's a good point - the sentence

When illustrating examples with a single Series use the name s

should probably be updated to

When illustrating examples with a single Series use the name ser

Regarding the "PDF023" - this one is from a flake8 extension which we're no longer using, having moved to ruff - to remove, you could either recreate your environment, or just do

pip uninstall pandas-dev-flaker -y

(or the mamba equivalent)

Comment From: DeaMariaLeon

@MarcoGorelli just remember that the letter "s" is all over the place.. I don't know how many examples there are with this, but it seems like a colosal change. Also, if we remove the PDF023, does it mean we don't need to worry about this error, or does ruffcomplain also about it?

Comment From: MarcoGorelli

sure, maybe it's too large to change everywhere, but for new docstrings we could change the advice

and no, ruff doesn't complain about it (though technically flake8 is still used for docstrings, but without extra plugins installed)

Comment From: DeaMariaLeon

Updated: Numpy, scikitlearn etc also have single letter names in the docstrings..

Also, I don’t know where on github to keep talking about this, as the same issue was opened several times. With the PR #51634 to remove the parentheses, I only modified one of the functions of the original PR #51356. When I locally test the other functions that were modified (with #51356), there are still flake8 errors. Plus they also have the extra parentheses - that could confuse beginners.

I understand that just one type of flake8 error needed to be fixed, as instructed in the original issue. But since we are not testing EX03 yet, the rest of flake8 errors didn't come up I think.

I’m attaching the errors I get. I'm not sure if I should open an issue.

PR51356 errors.txt

Comment From: MarcoGorelli

thanks for flagging this - I think https://github.com/pandas-dev/pandas/pull/51356 was an improvement because it took us from invalid syntax such as

df.style.format(na_rep='MISS', precision=1, subset=[0])
    .format(na_rep='PASS', precision=2, subset=[1, 2])

to valid syntax such as

(df.style.format(na_rep='MISS', precision=1, subset=[0])
    .format(na_rep='PASS', precision=2, subset=[1, 2]))

Maybe these can be made more legibile, you're right, e.g.

df.style.format(na_rep="MISS", precision=1, subset=[0]).format(
    na_rep="PASS", precision=2, subset=[1, 2]
)

, if you wanted to open an issue about that (or directly make a PR), that'd be welcome

And if you wanted to open an issue about EX03 (similar to the current EX02/EX01 ones, which are gradually getting addressed - nice 🎉 ), that would also be welcome

Comment From: DeaMariaLeon

I will. Thanks for the explanations.

Comment From: MarcoGorelli

I think Marc makes a good point here that's it's not really worth changing

In general for these kinds of changes, it's probably more sustainable to either make changes as suggested by some automated tool, or to leave them, else it's a matter of taste and we end up ping-ponging between preferences

Thanks anyway for your comments and for engaging in PR review though, much appreciated!