This code
df=pd.DataFrame([['שלום','עליכם', 'מלאכי השלום', 'ברוכים הבאים']])
df.columns=['A','B','C','D']
df
does this:
See that the data row is close to the right?
Half a workaround is to prepend the unicode LTR character to each row:
print u"\n\u200e".join(unicode(df).split("\n"))
but if you look you can see the data appears in reverse order, since:
In [49]: print df.iloc[0,0]
שלום
The same occurs even if the text is decoded into unicode strings.
I tried to understand the repr code in pandas but it's too complex for me to follow, full of special cases. I hope the developers can fix this.
Comment From: cpcloud
I don't think this is a pandas issue, because just using a list
of the characters above yields the same behavior:
In [35]: x = ['שלום','עליכם', 'מלאכי השלום', 'ברוכים הבאים']
x[0] == 'שלום'
Out[57]: True
Comment From: cpcloud
xref: https://github.com/ipython/ipython/issues/5960
Comment From: cpcloud
I think this should be tracked at IPython because this isn't a pandas issue.
Comment From: David-Alkalai
What "same behavior" do you mean, exactly?
The problem is not that x[0] == 'שלום'
. I included that only to show that
even though the data value is in location (0,0) in the dataframe, it appears in the
4th column in the repr when using the unicode LTR trick I suggested.
ipython/ipython#5960, which I opened earlier today is about a seperate issue having to do with RTL support.
Comment From: cpcloud
What "same behavior" do you mean, exactly?
By "same behavior" I mean that the slice semantics are the same for list
, ndarray
, Series
and DataFrame
objects and probably any other sequence. There might be some kind of global modification to __getitem__
.
The problem is not that x[0] == 'שלום'. I included that only to show that even though the data value is in location (0,0) in the dataframe, it appears in the 4th column in the repr when using the unicode LTR trick I suggested.
Check out some operations on the Series
and the underlying ndarray
, they're reversed:
In [96]: original = pd.Series(['שלום','עליכם', 'מלאכי השלום', 'ברוכים הבאים'])
In [97]: df = DataFrame([original.values])
In [98]: print(df)
0 1 2 3
0 שלום עליכם מלאכי השלום ברוכים הבאים
In [99]: print(df.loc[0])
0 שלום
1 עליכם
2 מלאכי השלום
3 ברוכים הבאים
Name: 0, dtype: object
In [100]: values = original.values
In [101]: map(lambda x: print(x.decode('utf8')), values)
שלום
עליכם
מלאכי השלום
ברוכים הבאים
Out[101]: [None, None, None, None]
This is not something wrong with pandas; it looks like a feature of IPython. Your LTR trick works for the repr
, but doesn't work for the iteration most likely because they're two separate mechanisms for dealing with RTL locales. I'm speculating a lot here, but none of these issues is something wrong with pandas. I'm happy to be proven wrong, just need an example that's specific to pandas.
Comment From: David-Alkalai
By "same behavior" I mean that the slice semantics are the same for list, ndarray, Series and DataFrame objects and probably any other sequence.
Yes, what about it?
There might be some kind of global modification to getitem.
I don't understand exactly what you mean by this. If you suspect that ipython changes the meaning of slice indexes if the contents are RTL text... obiously that's not a reasonable assumption.
Whether it's specific to pandas or not depends on your point of view. You could change IPython or pandas to fix it (maybe), I was suggesting that pandas make use facilities from the unicode specification to ensure the repr is displayed correctly.
If that's "specific to pandas" enough, I can't say.
The issue I think is that a dataframe is tabular in nature, but the way the text repr is constructed it gets rendered like a text string and unicode bidi kicks in where it shouldn't. So I'm suggesting pandas might want to be explicit about it.
Your LTR trick works for the repr, but doesn't work for the iteration most likely because they're two separate mechanisms for dealing with RTL locales.
Sorry, I don't understand your point about iteration. What two mechanisms are they that you mean?
Comment From: cpcloud
Yes, what about it?
Simply answering your question.
I don't know why those slice semantics are like that, but that RTL iteration seemingly goes in the opposite direction for an arbitrary sequence smells a bit funny to my LTR centric brain. Again, I was simply pointing out that that's why it shows up like it does in the DataFrame
repr. Lotta thats there. Sorry about that :smile:
You could change IPython or pandas to fix it, I was suggesting that pandas make use facilities from the unicode specification to ensure the repr is displayed correctly.
We merge pull requests liberally around here, so if you'd like to put one up, we're happy to assist you in navigating the hell :fire: that is the repr-ing code. I know I've done :scream_cat: :arrow_right: :rage: many a time in the past when trying to grok that code.
I don't know what the unicode spec is for RTL languages, but it sounds like you do. I'd love to learn about it from you in the context of some real code that actually uses it.
What two mechanisms are they that you mean?
What I meant what that iterating from RTL doesn't change to iterating from LTR when you prefix the characters with a LTR character (not sure why I thought it should, but then I don't understand why it goes from RTL in the first place). So, just forget what I said above about the two mechanisms.
Comment From: cpcloud
Maybe I was a bit hasty in closing this.
Comment From: David-Alkalai
Thanks for reopening.
Really there's no difference in how iteration or slicing work like you suggest(ed) It's just the way the text is displayed. Very resonable that it would seem bizzare, hebrew/arabic/farsi/urdu speakers are just used to it.
We merge pull requests liberally around here
and use emoji!
Comment From: dsm054
Out of curiosity, do Hebrew tables normally have columns in RTL order? Would you typically see "3 2 1 0", or "dalet gimel bet alef", reading left to right on the page?
Comment From: David-Alkalai
Mostly, yes: Column order matches "Reading order".
EIther way would be fine though, since you may have english column headings but hebrew contents. But having some rows appear flush right and some left, and the data appearing under the wrong columns is of course not great.
Comment From: cpcloud
@David-Alkalai I just figured out that not only are the hebrew characters encoded in RTL but so are the other parts of Python syntax, I think that was what was throwing me off.
Comment From: David-Alkalai
print "|".join([u'\u200e'+x.decode('utf8')
for x in ['שלום','עליכם', 'מלאכי השלום', 'ברוכים הבאים']])
Works. The unicode marker has to prefix every single word rather appearing just one at the start of the line.
So this is solved in principle. It still requires someone who understands all that pandas repr code to fix. I've done what I can to help.
Comment From: GKjohns
Hello from 2019 🙋🏽♂️. Has anything been done to fix this issue? I'm running into something similar for Arabic
Comment From: jreback
@GKjohns welcome for a pull request to fix
Comment From: jorisvandenbossche
Just ran into a quite confusing issue debugging one of our tests:
In [23]: import pandas._testing as tm
In [32]: index = tm.makeUnicodeIndex(3)
In [33]: index
Out[33]: Index(['3נפ583עדני', 'צ2מןץעס730', 'ם2נ1ע3אט7פ'], dtype='object')
In [34]: s = pd.Series([0, 1, 2], index=index)
In [35]: s
Out[35]:
3נפ583עדני 0
צ2מןץעס730 1
ם2נ1ע3אט7פ 2
dtype: int64
In [36]: s.index
Out[36]: Index(['3נפ583עדני', 'צ2מןץעס730', 'ם2נ1ע3אט7פ'], dtype='object')
In [37]: s.values
Out[37]: array([0, 1, 2])
So the issue is also present for Series, and not only DataFrame, and not only in the QtConsole.
Comment From: chapmanjacobd
when using series.to_string(index=False, header=False))
on a Series which has a RTL character somewhere in a single row, all the lines of the output will become RTL. maybe that kind of makes sense though; maybe not a bug--just weird and kind of annoying
Comment From: jbrockmendel
wait wait go back. does python itself change how it handles lists based on the presence of LTR characters? yikes.