Pandas TODO: more pprint imporvements

TODO

[ ] dicts with many keys should be summarized (Update: I can no longer remember the use case for dicts printed via pprint_thing).
[ ] summarization should match numpy [a.a.a ... b b b] not [a a a ...]
[ ] add np.set_printoptions(edgeitems) analogue.

~~- [ ] options.display.max_seq_items should have a default value != None #3391, #5120 , #5629~~

via https://github.com/pydata/pandas/pull/5753

Comment From: cpcloud

@y-p i can take this if you want

Comment From: ghost

please.

Comment From: d1manson

pprint_thing is REALLY slow when used with numpy arrays. For example, if I have a DataFrame with 100 100x100 masked arrays, it takes about 10 seconds to print it, which is really irritating.

some_data = np.ma.array(np.random.rand(100,100),mask=np.random.rand(100,100)>0.2)
df = pd.DataFrame(dict(example=[some_data]*100))
print df

Is there some scope for calling str(ndarray) somewhere in pprint_thing, and perhaps simply doing .replace('\n',',')? Obviously, it would need to work within the recursive framework of pprint_thing, but wouldn't implement any special recursion itself (i.e. a tuple of ndarrays should be iterated over by pprint_thing and have their numpy __str__ methods invoked).

I'm still pretty new to pandas, so maybe I've missed something.

(For reference, I'm doing neuroscience: I have two or three "levels" of analysis, the top most of which, ie. the most meta-level, I would like to be doing in pandas, but I would very much like to store some of the lower level stuff in a DataFrame alongside the meta-stuff.)

Comment From: jreback

What you are doing is extremely inefficient. Pandas (and numpy) are generally best used to hold a single scalar in a cell, which can be represented by a base type (e.g. a float).

Try this

In [9]: df = DataFrame(np.random.randn(100,100))

In [10]: df = df.where(df>0.2)

If you need multiple levels, simply add a multi-index.

Changing a printing routine to handle this use case is not likely to happen as it would increase the code complexity (over which it already is pretty crazy)

Comment From: d1manson

I think I explained that rather poorly.

Imagine my 100 arrays is a list of images of random shapes, with each image having a multiindex tuple of (person, day, hour). I compute a bunch of metrics for each image and put the results as columns in my dataframe, but also putting the images themselves as a column. I can then do my meta analysis on both groups of the raw images and/or on the scalar-valued columns...I need to be able to do both. I have already submitted a pull request which makes it easyish to render numpy arrays as images in base64 data within html img tags, but the default display mechanism is still this slow pprint call.

Comment From: jreback

well if you really really want to do this. I would simply wrap an object around the array and give it a custom printing method. Then you can do whatever you want. Pandas tries to do the right thing by printing nested things, but in this case you are putting something which pandas can render there.

Comment From: d1manson

yes, that had occurred to me, but it's not as convenient and Id hoped that this kind of blob-like usage was mainstream enough to merit a line or two in the right place!

Comment From: jreback

this is not the right way to harness the power of pandas

you are almost certainly better off keeping your images as numpy arrays or whatever (or frames) and simply having a references (eg a string to them in a particular column)

or use the object soln

you are trying to shoohorn s frame onto what you need but it's really suited for that

Comment From: d1manson

Shoohorning maybe, but the result is actually quite good - I'd recommend it to anyone else doing a similar kind of analysis. And, strictly speaking I think what is being stored in the DataFrame is a reference - it's a reference to the ndarray instance, right? It's not like the DataFrame has the ndarray's data stored contiguously within it?

Comment From: WillAyd

Closed as ambiguous