Pandas ENH: Add pd.read_soql() for Salesforce data retrieval

Feature Type

[x] Adding new functionality to pandas
[ ] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas

Problem Description

I wish I could use Pandas to query Salesforce using SOQL (Salesforce Object Query Language) just as you do with relational databases using pd.read_sql(). Currently, to retrieve Salesforce data into a Pandas DataFrame, users must execute queries using simple_salesforce and manually convert results to DataFrames, which in case of multi-level queries can become quite inefficient (due to the nested format of the outputs of Salesforce REST API).

Feature Description

The function would parallel pd.read_sql() and would look like this:

def read_soql(
    query: str,  # Equivalent to `query` in `pd.read_sql()`, representing the SOQL query string.
    con,  # Expects a `simple_salesforce.Salesforce` object instead of an SQLAlchemy connection.
    index_col: str | list[str] | None = None,  # Same as in `pd.read_sql()`
    parse_dates=None,  # Same as in `pd.read_sql()`
    dtype: DtypeArg | None = None,  # Same as in `pd.read_sql()`
    dtype_backend: DtypeBackend | lib.NoDefault = lib.no_default  # Same as in `pd.read_sql()`
) -> DataFrame:

    # validate connection is Salesforce object
    # validate dtype_backend is valid option

    # execute SOQL query and get all records
    # flatten output and remove metadata
    # convert records to DataFrame

    # if dtype specified: convert columns to specified types
    # if parse_dates specified: convert date columns
    # if index_col specified: set DataFrame index

    # if dtype_backend != 'numpy': convert to nullable types
    # return DataFrame

Alternative Solutions

Alternative naming Since pd.read_sql() is a convenience wrapper around read_sql_table and read_sql_query, a more "formally" correct name might be read_soql_query, as there is no corresponding read_soql_table. This would maintain a closer parallel to Pandas' SQL functions.
However, I propose read_soql for brevity, and for consistency with other I/O functions such as pd.read_excel(), pd.read_parquet(), pd.read_feather(), pd.read_orc() etc.

Additional Context

I am interested in developing this feature as I have already done some work towards its implementation.

Comment From: benvigano

take

Comment From: rhshadrach

Thanks for the request. I do not think this is a standard enough data format for inclusion in pandas directly. However there are other options, see Extending pandas. In particular, a third party package that registers a custom accessor. If such a package is made, we'd be happy to add it to our Ecosystem page.

cc @pandas-dev/pandas-core for other thoughts.

Comment From: datapythonista

If it's just a reader, I don't think an accessor is needed. This could be implemented as:

import pandas_soql

df = pandas_soql.read_soql(...)

I agree this shouldn't live in pandas. If we don't want this as an independent module from pandas (other than the ecosystem entry), I think the way forward would be to revisit https://pandas.pydata.org/pdeps/0009-io-extensions.html

Comment From: benvigano

Understood. Just out of personal curiosity, @rhshadrach when you say it's not standard enough, is it about the API/simple-salesforce not having enough guarantees of consistency, or just that not many people use soql?

Comment From: rhshadrach

@benvigano - the latter.