Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

url = "https://en.wikipedia.org/wiki/2013–14_Premier_League"
url_escaped = urllib.parse.quote_plus(url, "/:?=&") # need this line
print(f"url:{url}")
print(f"escaped:{url_escaped}")
tables = pd.read_html(url_escaped, encoding="UTF-8") # non-error
tables = pd.read_html(url, encoding="UTF-8") # error

# url:https://en.wikipedia.org/wiki/2013–14_Premier_League
# escaped:https://en.wikipedia.org/wiki/2013%E2%80%9314_Premier_League

Issue Description

Some people faced UnicodeEncodeError when using read_html. Internally, Pandas use urllib which can't support multibyte URLs. so, we need to escape the URL before passing the URL.

so, I think we can fix it by passing the escaped URL instead of a multibyte URL.

Expected Behavior

It should be returned expected tables even if we passed the multibyte URL.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 3e89b4c4b1580aa890023fc550774e63d499da25 python : 3.9.1.final.0 python-bits : 64 OS : Darwin OS-release : 21.5.0 Version : Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:29 PDT 2022; root:xnu-8020.121.3~4/RELEASE_ARM64_T8101 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 1.2.0 numpy : 1.23.1 pytz : 2020.5 dateutil : 2.8.1 pip : 22.0.3 setuptools : 51.1.1 Cython : None pytest : 7.1.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.6.2 html5lib : 1.1 pymysql : 1.0.2 psycopg2 : None jinja2 : 2.11.2 IPython : 7.22.0 pandas_datareader: None bs4 : 4.9.3 bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.4.2 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 6.0.1 pyxlsb : None s3fs : None scipy : 1.7.3 sqlalchemy : 1.4.37 tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None In [4]: pd.show_versions() In [4]: pd.show_versions() In [4]: pd.show_versions() INSTALLED VERSIONS ------------------ commit : 3e89b4c4b1580aa890023fc550774e63d499da25 python : 3.9.1.final.0 python-bits : 64 OS : Darwin OS-release : 21.5.0 Version : Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:29 PDT 2022; root:xnu-8020.121.3~4/RELEASE_ARM64_T8101 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 1.2.0 numpy : 1.23.1 pytz : 2020.5 dateutil : 2.8.1 pip : 22.0.3 setuptools : 51.1.1 Cython : None pytest : 7.1.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.6.2 html5lib : 1.1 pymysql : 1.0.2 psycopg2 : None jinja2 : 2.11.2 IPython : 7.22.0 pandas_datareader: None bs4 : 4.9.3 bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.4.2 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 6.0.1 pyxlsb : None s3fs : None scipy : 1.7.3 sqlalchemy : 1.4.37 tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

Comment From: kouml

Related Background Issue https://github.com/pandas-dev/pandas/issues/21499

Comment From: kouml

@mroeschke Hi, What do you think about the current issue? Especially, This is obviously a worse effect on people from Unicode-character-based countries. Because They use the Unicode-character in the URL. (Please see the above Wikipedia URL.) If you agree, I can fix it. Thanks in advance.

Comment From: kouml

Closed, This is the spec rather than the bug. details is here. https://github.com/pandas-dev/pandas/pull/50259