We recently added our workgroup email addresses to our website: https://pandas.pydata.org/about/team.html#workgroups
While this has been useful, and we received relevant emails from people that otherwise wouldn't know how to contact us easily, we also started receiving spam. I'm unsure if spam is generated manually by people ending up in our website, or by bots fetching our email addresses automatically. But in case it's the latter, I think it'd be good to see if we can easily obfuscate the email addresses in the html code.
I guess there are many options, but it comes to my mind that something very easy that could possibly stop some of the spam would be to simply prepend a string to the email addresses in the html, and then remove it via javascript. This won't help with spammers getting our addresses manually, or using scrappers with javascript support like selenium, but with some luck most of the spam comes from simpler bots just fetching the html.
The idea would be that for example if the address is address@pandas.pydata.org
, the html generated from the markdown is something like <a href="mailto:noaddress@pandas.pydata.org">noaddress@pandas.pydata.org</a>
, and then we have a simple javascript block that removes the no
and makes the final html rendered to the user contain the right address.
This is the file where this should be implemented: https://github.com/pandas-dev/pandas/blob/main/web/pandas/about/team.md#-workgroupname-
Comment From: Kabiirk
Hi,
First time contributor to Pandas here. There are many ways to obfuscate emails on websites & prevent bots from scraping them for e.g. :
* address[at]pandas[dot]pydata[dot]org
(this would be text on the website itself, but more work at the user-end to copy & replace these characters, plus i think bots can just replace this)
* Use special HTML characters to our advantage (this doesn't guarantee protection, but would reduce bot scraping a bit) :
<a href="mailto:address@pandas.pydata.com"> user@domain.com</a
* Encode it completely like :
<a href="mailto:ben
;utzer@domai
n.de">email</a>.
* or other methods (WIP, will see what I can find)
A few resources which I found were as follows : * How to protect your website email address from spam * Email Obfuscation
As per my understanding of the scope of work for this issue, I would need to edit the Markdown file mentioned by you to parse the text differently but display it as close to the email id as possible. If so, please let me know if I can work on this issue.
Thanks & Regards
Comment From: datapythonista
Thanks for the help @Kabiirk. What you say is correct, just keep in mind these goals: - Avoid scrapers to get our correct email addresses - Allow visitors of the website to find and use our email addresses easily - Keep things easy in our code/markdown so maintenance is straightforward
Some of the things you mention make a lot of sense, but seem to overcomplicate things too much, since it'd require writing code that does the encoding or transformation of the email address in our web generator script. That's what I thought that just prepending some text to the addresses was a better idea. In any case, it's great if you can work on this, and I'm open to ideas, just keep in mind those goals. Thanks!
Comment From: Kabiirk
Thanks, Will keep these goals in mind.
to assign this issue to me, to I need to do a TAKE
command in this thread ?
Comment From: datapythonista
I assigned it to you. For next time, yes, you need to write just take
(lowercase) in a comment (that's a hack we implemented since GitHub won't allow you to assign directly).
Comment From: Kabiirk
Thanks, Understood
Comment From: Kabiirk
Hi,
Facing some challenges
Challenge 1
While building the website from source (pandas/web
) with the the below command :
C:/path_to_local_pandas_fork/pandas/web> python pandas_web.py pandas
The static site is being generated, but looks like this
while the Official site looks like this:
Potential Reason
Way Forward
I think This is a rate-limiting & a CSS thing, Since my main work is with Emails, I don't think this should be a problem. I'll carry on with my work. But since I am going to run this command frequently during testing, I hope there would be no issues if I do that ?
Challenge 2
Also, while initially building the static site, I got the following error at 4 instances:
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>
the LOC instances which caused these errors were : https://github.com/pandas-dev/pandas/blob/11d856f52689998bf8c5427e2f9168452a44f8e9/web/pandas_web.py#L113-L114 https://github.com/pandas-dev/pandas/blob/11d856f52689998bf8c5427e2f9168452a44f8e9/web/pandas_web.py#L343-L344 https://github.com/pandas-dev/pandas/blob/11d856f52689998bf8c5427e2f9168452a44f8e9/web/pandas_web.py#L415-L416 https://github.com/pandas-dev/pandas/blob/11d856f52689998bf8c5427e2f9168452a44f8e9/web/pandas_web.py#L425-L426
so I did some troubleshooting & found out that this error is caused because because we aren't telling the open()
call what codec to use when reading the file. Because of this the file is opened with the system default codec, which is OS dependent.
Potential Reason
OS : Windows 10 Home Single Language
This maybe because my OS's default character encoding codec is not utf-8
.
Possible Fix [This has only been implemented in my local Branch] :
In all 4 instances, I modified pandas_web.py
by specifying the character encoding codec while opening the file solved this issue for me i.e. explicitly telling open that we are reading a utf-8
encoded file. For e.g. I did something like:
with open(filepath, encoding='utf8') as f:
f.write(content)
Way Forward
After I am done with the current issue I am working on, Should I open a separate issue for this ?
Comment From: Kabiirk
Hi,
I have implemented a JavaScript based solution for protecting workgroup
emails:
Which looks the same as current workgroup
email on the website. mailto:
is also functioning when I hover & click on the email:
To test this, I wrote a Web Scraper in BeautifulSoup
which was not able to detect Workgroup email IDs (both after mailto:
& between <a></a>
tags) when webpage used my JavaScript implementation to write emails. The result is as follows :
Note : This will not stop all scraping bots, but should be able to stop a lot of them while at the same time be simple to maintain and easily accessible by the end-user.
Please let me know if I can go ahead and make the PR.
Regards.
Comment From: datapythonista
Thanks for the work on this @Kabiirk, sounds great. Sure, go ahead and open the PR (for next time, feel free to open a PR anytime, even if you're unsure of the approach...). You can tag me on it.
Comment From: Kabiirk
@datapythonista Thanks for the help 😄 ! I'll keep that in mind. Opening the PR in a while.
Please do let me know if I should open an Issue for the UnicodeDecodeError
I was facing while building the website from pandas_web.py
? Or fix it in this PR itself ?