Feature Type
-
[x] Adding new functionality to pandas
-
[ ] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
I wish I could use Pandas to query Salesforce using SOQL (Salesforce Object Query Language) just as you do with relational databases using pd.read_sql()
. Currently, to retrieve Salesforce data into a Pandas DataFrame, users must execute queries using simple_salesforce and manually convert results to DataFrames, which in case of multi-level queries can become quite inefficient (due to the nested format of the outputs of Salesforce REST API).
Feature Description
The function would parallel pd.read_sql()
and would look like this:
def read_soql(
query: str, # Equivalent to `query` in `pd.read_sql()`, representing the SOQL query string.
con, # Expects a `simple_salesforce.Salesforce` object instead of an SQLAlchemy connection.
index_col: str | list[str] | None = None, # Same as in `pd.read_sql()`
parse_dates=None, # Same as in `pd.read_sql()`
dtype: DtypeArg | None = None, # Same as in `pd.read_sql()`
dtype_backend: DtypeBackend | lib.NoDefault = lib.no_default # Same as in `pd.read_sql()`
) -> DataFrame:
# validate connection is Salesforce object
# validate dtype_backend is valid option
# execute SOQL query and get all records
# flatten output and remove metadata
# convert records to DataFrame
# if dtype specified: convert columns to specified types
# if parse_dates specified: convert date columns
# if index_col specified: set DataFrame index
# if dtype_backend != 'numpy': convert to nullable types
# return DataFrame
Alternative Solutions
Alternative naming
Since pd.read_sql()
is a convenience wrapper around read_sql_table
and read_sql_query
, a more "formally" correct name might be read_soql_query
, as there is no corresponding read_soql_table
. This would maintain a closer parallel to Pandas' SQL functions.
However, I propose read_soql
for brevity, and for consistency with other I/O functions such as pd.read_excel()
, pd.read_parquet()
, pd.read_feather()
, pd.read_orc()
etc.
Additional Context
I am interested in developing this feature as I have already done some work towards its implementation.
Comment From: benvigano
take
Comment From: rhshadrach
Thanks for the request. I do not think this is a standard enough data format for inclusion in pandas directly. However there are other options, see Extending pandas. In particular, a third party package that registers a custom accessor. If such a package is made, we'd be happy to add it to our Ecosystem page.
cc @pandas-dev/pandas-core for other thoughts.
Comment From: datapythonista
If it's just a reader, I don't think an accessor is needed. This could be implemented as:
import pandas_soql
df = pandas_soql.read_soql(...)
I agree this shouldn't live in pandas. If we don't want this as an independent module from pandas (other than the ecosystem entry), I think the way forward would be to revisit https://pandas.pydata.org/pdeps/0009-io-extensions.html
Comment From: benvigano
Understood. Just out of personal curiosity, @rhshadrach when you say it's not standard enough, is it about the API/simple-salesforce not having enough guarantees of consistency, or just that not many people use soql?
Comment From: rhshadrach
@benvigano - the latter.