Pandas Add support for a pandasrc - Nineya|java/go/python

Just using the existing configuration framework, but with a file format like matplotlib uses... See how they do it here: http://matplotlib.org/users/customizing.html.

Plus, we can document all the config options in a single file.

related #2452, #3046

Comment From: cpcloud

+1 for a matplotlibish format

Comment From: ghost

both ipython and python itself have existing startup file mechanisms in place, I'd still like to hear a good argument explaining what a .pandasrc file provides that those do not.

Comment From: jtratner

@y-p lets you distribute a project with settings options in the directory to make it work the way you expect and allow you to set something for one project that you aren't doing for others. You can't unilaterally overwrite or append to people's python startup files. This is a way to get granular formatting without needing to alter those things or even know where they are. And this doesn't prevent you from overriding those settings in a startup file.

Comment From: ghost

I see, so it's not about an addition to a user's dotfiles but about making a project self-contained. In that case, how would such a file improve on just putting some plain ol' initialization code in the source files?

Comment From: jtratner

@y-p requires imports or setting environment variables. It's also a pretty minor addition - what's below plus a bit of searching for the path to the config file:

def from_object(obj):
    if hasattr('items'):
        set_option(*obj.items())
    else:
        for k in obj:
            set_option(k, obj[k])

def from_file(path_or_buf):
    from pandas.core.common import _get_handle
    option_splitter = re.compile('\s*[:=]\s*').split
    f = _get_handle(path_or_buf)
    errors = []
    for i, line in enumerate(f):
        # allow for comments
        line = line.split('#')[0].strip()
        if line:
            try:
                split = option_splitter(line)
                if len(split) == 2:
                    option, value = split
                    set_option(option, value)
                else:
                    raise ValueError("Malformed option")
            except (KeyError, ValueError) as e:
                errors.append("%d: %s" % (i, e))
    print errors

Comment From: jtratner

and clearly better errors, etc.

Comment From: jtratner

Plus, this separates config options from actual code, which is a net gain.

Comment From: cpcloud

Personally, I'd rather set some config file than have a bunch of calls to pd.set_config() in a startup file. And like @jtratner says, supporting different configs for different projects by e.g., reading a single from the current directory is easier to keep track of than a bunch of calls to pd.set_config().

Comment From: ghost

Nope, initialization code does not requires either envars or imports. Don't see how the errors would be clearly better. If it is, why not improve the existing error messages? If you want to seperate config options from actual code, use another source file, hey presto. Configuraion is code last time I checked, and seems to be elsewhere as well. I don't think I've worked on a project that does not have some sort of settings.py, that's familiar to most people (certainly django folk).

I don't follow what @cpcloud means by "keeping track of" at all. configuration is just code. I don't think that holds up.

re verbosity of pd.set_config. yep. big deal. How many LOC are we talking? meh.

Comment From: cpcloud

Nope, initialization code does not requires either envars or imports.

I don't see how that's true re imports.

I don't think I've worked on a project that does not have some sort of settings.py

Lucky you. I have, and it's not fun.

I don't follow what @cpcloud means by "keeping track of" at all.

I just meant that I'd rather have a single config file per project than having to copy paste pd.set_config() and tweak the calls, but a settings.py-like file works just fine too. It's not a big deal, I actually use the defaults for almost everything so this doesn't really bother me.

And that's about all I'll say on that.

Comment From: jtratner

okay, that's fine. I'd like to add from_object (to allow you to create a dict and set a bunch of options at once) and make options support __getitem__ and __setitem__ just to clean things up.

In other words, allows you to do this (which is nearly equivalent to what the pandasrc would do in syntax, etc):

config.from_object({
'io.excel.xlsx.writer': 'openpyxl',
'display.max_rows': 80
})

and this

config.options['io.excel.xlsx.writer'] = 'openpyxl'
config.options['display.max_rows'] = 80

Is that okay? feels cleaner to me than a series of function calls.

Comment From: jreback

look ok by me....only requrest I have is to make setting the docs for the option easier (not sure how as you generally need/want a multi-line, so end up creating a variable to hold it.....)

Comment From: ghost

Yeah, I like that too. The python logging module supports a from_dict for configuration, which comes in handy in that situation. I find I need to change options so I don't miss this in practice, but if you feel there's a need then from_dict is a good way to do it.

re supporting set/getitem, You can already get/set values directly, e.g. display.foo =1. "Cleaner" is partly a matter of taste and that form isn't more concise then the existing set_option mechanism, nor more convenient then the existing options. way of doing it which also provides tab-completion. Do you feel strongly that adding a 3rd way to do the exact same thing is worth it?

Comment From: jtratner

Didn't realize it supports setattr.

On Sun, Sep 22, 2013 at 7:09 AM, y-p notifications@github.com wrote:

Yeah, I like that too. The python logging module supports a from_dict for configuration, which comes in handy in that situation. I find I need to change options so rarely I don't miss this in practice, but if you feel there's a need then from_dict is a good way to do it.

re supporting set/getitem, You can already get/set values directly, e.g. display.foo =1. "Cleaner" is partly a matter of taste and that form isn't not more concise then the existing set_option mechanism, nor more convenient then the existing options. way of doing it which also provides tab-completion. Do you feel strongly that adding a 3rd way to do the exact same thing is worth it?

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/4907#issuecomment-24880117 .

Comment From: jreback

heres a perfect case for this, #2612

Comment From: jreback

push to 0.14?

Comment From: jtratner

Sure, but are we even doing this anymore?

On Wed, Oct 2, 2013 at 5:36 PM, jreback notifications@github.com wrote:

push to 0.14?

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/4907#issuecomment-25579191 .

Comment From: jreback

can certainly close? didn't you have a use case for it?

Comment From: jtratner

it would be useful to me, but @y-p's point that it's not necessary seems reasonable too.

Comment From: jreback

ok move to someday or 014 for revisiting

Comment From: jtratner

whichever.

On Wed, Oct 2, 2013 at 7:15 PM, jreback notifications@github.com wrote:

ok move to someday or 014 for revisiting

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/4907#issuecomment-25585303 .

Comment From: jreback

@jtratner did you move this to 0.13?

Comment From: jtratner

I've put this for 0.13, because the feedback from SettingWithCopy suggests that there are a non-trivial number of people who are going to want to be able to shut off the copy warnings automatically.

Comment From: jtratner

shouldn't take too long to do either.

Comment From: jreback

and how is this different that just doing pd.set_option('chained_assignment',None)....so far I saw exactly 1 comment on that from bigbug; and IMHO he should keep it on... my2c

Comment From: jtratner

it's not, it just means you can turn it off for old scripts. There have been one or two issues posted on pandas about this as well - e.g. #5597

Comment From: jreback

users who want to turn off the warnings almost certainly have an ipython startup script already. I just think a pandasrc is pretty duplicative IMHO. The point of the warning is mostly for new users in any event.

Comment From: jtratner

okay, pushed to someday again

Comment From: ghost

The python/ipython startup files are less known then we might think, and there's only a slight btw in the FAQ. I'll open a doc issue for 0.14.

Comment From: cbrnr

I'd like to revive this issue. A pandasrc file would be very useful for people that do not want to import pandas in their IPython startup script.

Comment From: t-makaro

I'd like to have a configuration file for pandas. I hate having to always place:

pd.set_option('display.latex.repr', True)
pd.set_option('display.latex.longtable', True)

at the start of my notebook especially since I always have to look it up.

Using an ipython startup file is not a solution since I don't want pandas to always import (I also don't want to hide the import. I just want to hide the settings that I use to export to pdf).

Comment From: benpayne

I am looking into this implimenting this feature. I've reviewed the mentioned guidelines for implimenting based on matplotlibrc. It searches a few spots for the rc file that mostly make sense, but with a few exceptions.

Steps to find the config file.
1. check local directory: Seems like a good idea for a projec to override settings 2. checks a env varaible (MATPLOTLIBRC) and if it exists looks for a file at that path $MATPLOTLIBRC/matplotlibrc: Seems redundant to look for a file at that path. Why not just have the env variable point to the file. This would allow you to have several RC files in the same directroy and just change your env to get different behavior.
3. Look at the users home dir and find a rc file there: No changes to this, makes sense. 4. Check an install file location that will be over written every time the package is installed: This seems like a bad idea. While I think a example RC file in this location makes sense, something poeple can copy and modify for themselves. Also something that has comments documenting all the options. Making this file that has to be parsed for every user that imports pandas seems to be excesive. Furthermore if developers want to change some defaults of various options, change that in the code, not by a global config file. So I am planning to drop this step.

Please let me know if I am missing anything in this analysis.

Another feature that comes to mind when looking at this is if we should search for the first RC file and stop or this should be a bottom up stack approach? Basically would a user like to have global settings in there home dir for all projects and then have the ability for a local project to override some settings without replacing the global settings. Or would this get confusing? I'd impliment that by parsing every RC we find (3,2,1) and then calling pd.set_option for each setting in the files. That way if the same setting was set in two files, the last file parsed would be the setting we run with. I'd like to hear others thoughts on this.

Comment From: TomAugspurger

I've been thinking about this again in the context of extension arrays.

I suspect we'll want config options to enable things like integer-na by default, apache arrow memory by default, etc. It'd be nice if we had a more robust config system before those land.

There's a lot of prior art with designing configuration systems. It'd be good to collect the strengths and weaknesses of those before we go off and do our own.

On Wed, Mar 13, 2019 at 1:35 PM Ben Payne notifications@github.com wrote:

I am looking into this implimenting this feature. I've reviewed the mentioned guidelines for implimenting based on matplotlibrc. It searches a few spots for the rc file that mostly make sense, but with a few exceptions.

Steps to find the config file.

check local directory: Seems like a good idea for a projec to override settings

checks a env varaible (MATPLOTLIBRC) and if it exists looks for a file at that path $MATPLOTLIBRC/matplotlibrc: Seems redundant to look for a file at that path. Why not just have the env variable point to the file. This would allow you to have several RC files in the same directroy and just change your env to get different behavior.

Look at the users home dir and find a rc file there: No changes to this, makes sense.

Check an install file location that will be over written every time the package is installed: This seems like a bad idea. While I think a example RC file in this location makes sense, something poeple can copy and modify for themselves. Also something that has comments documenting all the options. Making this file that has to be parsed for every user that imports pandas seems to be excesive. Furthermore if developers want to change some defaults of various options, change that in the code, not by a global config file. So I am planning to drop this step.

Please let me know if I am missing anything in this analysis.

Another feature that comes to mind when looking at this is if we should search for the first RC file and stop or this should be a bottom up stack approach? Basically would a user like to have global settings in there home dir for all projects and then have the ability for a local project to override some settings without replacing the global settings. Or would this get confusing? I'd impliment that by parsing every RC we find (3,2,1) and then calling pd.set_option for each setting in the files. That way if the same setting was set in two files, the last file parsed would be the setting we run with. I'd like to hear others thoughts on this.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/4907#issuecomment-472551504, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIrM_SvCb2vw_Fi4guVBw_WGkvX9mks5vWUTjgaJpZM4BBU3B .

Comment From: benpayne

I agree that it would be nice to leverage something existing. The config file format proposed is not very standard (key: value). It is simple, but that could be a weakness or a strength down the road. As for parsing this with standard packages, python configparser won't work, shlex could work, but a simple line by line split on ":" would probably be the easiest way to parse this. Today the option system is designed to only handle a singe value any option. So the "key -> value" paramdim is good for our use. Unless you are enviosining that this will change soon?

I'm a fan of using JSON for config files. Parsing is easy, it's very flexible down the road and easy to understand no matter how experianced you are with it.

Is there some specific "prior art" you had in mind that I could look at to evaluate for our uses?

Comment From: TomAugspurger

On Wed, Mar 13, 2019 at 2:44 PM Ben Payne notifications@github.com wrote:

I agree that it would be nice to leverage something existing. The config file format proposed is not very standard (key: value). It is simple, but that could be a weakness or a strength down the road. As for parsing this with standard packages, python configparser won't work, shlex could work, but a simple line by line split on ":" would probably be the easiest way to parse this. Today the option system is designed to only handle a singe value any option. So the "key -> value" paramdim is good for our use. Unless you are enviosining that this will change soon?

I'm a fan of using JSON for config files. Parsing is easy, it's very flexible down the road and easy to understand no matter how experianced you are with it.

I'm slightly against JSON for anything that's supposed to be edited by hand (as I suspect this config would be).

Is there some specific "prior art" you had in mind that I could look at to evaluate for our uses?

Any Python library with a config system, as each as likely implemented their own :) Django, IPython, Flask, Dask

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/4907#issuecomment-472576594, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIuP1jP3TDmbmDHlLcI2JuEPVebJLks5vWVUKgaJpZM4BBU3B .

Comment From: t-makaro

The Jupyter Project developed Traitlets which handles there configuration system. There config files are regular python files. Which is super awesome because I can use 1 config file that adapts nicely to multiple different systems that I manage.

One of the problems that I noticed with IPython startup files is that the startup file ran on any IPython environment that I used including ones that didn't have pandas installed. Something like a Pandas startup file that only loads on importing Pandas could work.

Comment From: benpayne

I've created a tool to dump the currently support options in the file format proposed by @jtratner. See the attachment for this. So looking over the options that exist today, most are bool, int or strings. However one is a callback function (display.float_format). To actually support this in a config file the file would probably have to be python, like Jupyter. That certianly creates a powerful system. But the fact that someone could put any code in that file could create some interesting side effects.

The concern I have about something like Traitlets is that over laps with code that has already been created in pandas to register options, provide callbacks when changed and a framework for validators to be supported. It will also requier reworking code around each of the nearly 50 options that are supported today. However it would take these options from a centeralize place and put them in the code that uses them. Always good or eliminate centralization in code bases...

When starting this I was envisioning leaving that infrastructure in place and simply building a layer on top that reads setting from a config file and invokes the current, relativly robust, system for setting options. The fact is the feature could be as simple as adding code at startup to look for a config file (in python formating) and load that file. The File itself could be a series of set_option calls. This is probably 10 lines of code and has minimal overhead, espaecially if no file is found.

I've worked with Django and Flask before. They both use python files for configuration. I'm not sure under the covers if they are doing some centralized like we are today or more like Traitlets is. IPython uses Traitlets from what I've read. Dask I brefly looked into and learned that the file format propesed in this issue (key: value) is called YAML and there is a project that support this format (PyYAML).

So it seems the design decision are coming down to:

File Format: YAML vs Python
Internal Storage: Centralized vs Decerntalized.

If the concensus is to go with a decentralized approach like Traitlets then this will be a much bigger change to the codebase. If that is the case we might want to seperate this into two issue. Building out the config file as the issues describes and then a larger task to rework the stoarge of options into a decentralized maner.

pandasrc.txt

Comment From: jbrockmendel

Discussed this on today's dev call and the consensus was mostly-negative. @MarcoGorelli had some points about inevitable feature creep into people wanting overrides in command-line and in pyproject.toml files. If a champion steps up to implement+maintain this we can reconsider, but for now im closing as no action.