Should pandas have a Symbol datatype for strings which are frequently re-used?
For example, consider a table of stock trades with columns ['Timestamp','Price','Size','StockSymbol']. Suppose there are 10 million trades, but only 500 different stock symbols. Instead of storing 10 million strings, we could fill the 'StockSymbol' column with integers 0-499 where 0 represents AAPL, 1 represents AMZN, etc.
The Symbol type in Ruby, Scala, and Q does this compression/expansion automatically. It's a big memory saver in big tables, and Symbol comparison can be much faster than string comparison.
I often use .astype('category')
for string compression in pandas. But at a talk I saw recently, @wesm suggested this might not be a great idea. I've definitely caused several confusing problems by abusing categoricals this way: comparing two columns with different levels, appending rows with new symbols, wrongly assuming sort_values()
will sort symbols lexicographically, etc.
Related question: Without a Symbol datatype, is there a consensus on best practice for storing repetitive strings in a Series or DataFrame? Should we leave them as Object
type? Compress them as un-ordered categoricals? Do something else I haven't thought of?
Comment From: jreback
duplicate of https://github.com/pandas-dev/pandas/issues/8640
simply use category
, which works pretty work for this right now. Yes its not excatly the same thing, but practicality does beat purity (for the time being).
Comment From: wesm
We've been discussing the idea of having dictionery-encoded strings (effectively what Symbol is) internally in pandas 2.0 -- see https://pandas-dev.github.io/pandas2/strings.html
Comment From: samkennerly
Oops, sorry about the duplicate and thanks for the links. I think a pandas.string
type with built-in dictionary encoding sounds pretty cool.