Pandas Feature Request: symbol datatype (for string compression)

Should pandas have a Symbol datatype for strings which are frequently re-used?

For example, consider a table of stock trades with columns ['Timestamp','Price','Size','StockSymbol']. Suppose there are 10 million trades, but only 500 different stock symbols. Instead of storing 10 million strings, we could fill the 'StockSymbol' column with integers 0-499 where 0 represents AAPL, 1 represents AMZN, etc.

The Symbol type in Ruby, Scala, and Q does this compression/expansion automatically. It's a big memory saver in big tables, and Symbol comparison can be much faster than string comparison.

I often use .astype('category') for string compression in pandas. But at a talk I saw recently, @wesm suggested this might not be a great idea. I've definitely caused several confusing problems by abusing categoricals this way: comparing two columns with different levels, appending rows with new symbols, wrongly assuming sort_values() will sort symbols lexicographically, etc.

Related question: Without a Symbol datatype, is there a consensus on best practice for storing repetitive strings in a Series or DataFrame? Should we leave them as Object type? Compress them as un-ordered categoricals? Do something else I haven't thought of?

Comment From: jreback

duplicate of https://github.com/pandas-dev/pandas/issues/8640

simply use category, which works pretty work for this right now. Yes its not excatly the same thing, but practicality does beat purity (for the time being).

Comment From: wesm

We've been discussing the idea of having dictionery-encoded strings (effectively what Symbol is) internally in pandas 2.0 -- see https://pandas-dev.github.io/pandas2/strings.html

Comment From: samkennerly

Oops, sorry about the duplicate and thanks for the links. I think a pandas.string type with built-in dictionary encoding sounds pretty cool.