We recently added our workgroup email addresses to our website: https://pandas.pydata.org/about/team.html#workgroups

While this has been useful, and we received relevant emails from people that otherwise wouldn't know how to contact us easily, we also started receiving spam. I'm unsure if spam is generated manually by people ending up in our website, or by bots fetching our email addresses automatically. But in case it's the latter, I think it'd be good to see if we can easily obfuscate the email addresses in the html code.

I guess there are many options, but it comes to my mind that something very easy that could possibly stop some of the spam would be to simply prepend a string to the email addresses in the html, and then remove it via javascript. This won't help with spammers getting our addresses manually, or using scrappers with javascript support like selenium, but with some luck most of the spam comes from simpler bots just fetching the html.

The idea would be that for example if the address is address@pandas.pydata.org, the html generated from the markdown is something like <a href="mailto:noaddress@pandas.pydata.org">noaddress@pandas.pydata.org</a>, and then we have a simple javascript block that removes the no and makes the final html rendered to the user contain the right address.

This is the file where this should be implemented: https://github.com/pandas-dev/pandas/blob/main/web/pandas/about/team.md#-workgroupname-

Comment From: Kabiirk

Hi,

First time contributor to Pandas here. There are many ways to obfuscate emails on websites & prevent bots from scraping them for e.g. : * address[at]pandas[dot]pydata[dot]org (this would be text on the website itself, but more work at the user-end to copy & replace these characters, plus i think bots can just replace this) * Use special HTML characters to our advantage (this doesn't guarantee protection, but would reduce bot scraping a bit) : <a href="mailto:address&commat;pandas&period;pydata&period;com"> user&commat;domain&period;com</a * Encode it completely like : <a href="&#x6d;&#x61;&#x69;&#x6c;&#x74;&#x6f;&#x3a;&#x62;&#x65;&#x6e ;&#x75;&#x74;&#x7a;&#x65;&#x72;&#x40;&#x64;&#x6f;&#x6d;&#x61;&#x69; &#x6e;&#x2e;&#x64;&#x65;">email</a>. * or other methods (WIP, will see what I can find)

A few resources which I found were as follows : * How to protect your website email address from spam * Email Obfuscation

As per my understanding of the scope of work for this issue, I would need to edit the Markdown file mentioned by you to parse the text differently but display it as close to the email id as possible. If so, please let me know if I can work on this issue.

Thanks & Regards

Comment From: datapythonista

Thanks for the help @Kabiirk. What you say is correct, just keep in mind these goals: - Avoid scrapers to get our correct email addresses - Allow visitors of the website to find and use our email addresses easily - Keep things easy in our code/markdown so maintenance is straightforward

Some of the things you mention make a lot of sense, but seem to overcomplicate things too much, since it'd require writing code that does the encoding or transformation of the email address in our web generator script. That's what I thought that just prepending some text to the addresses was a better idea. In any case, it's great if you can work on this, and I'm open to ideas, just keep in mind those goals. Thanks!

Comment From: Kabiirk

Thanks, Will keep these goals in mind.

to assign this issue to me, to I need to do a TAKE command in this thread ?

Comment From: datapythonista

I assigned it to you. For next time, yes, you need to write just take (lowercase) in a comment (that's a hack we implemented since GitHub won't allow you to assign directly).

Comment From: Kabiirk

Thanks, Understood

Comment From: Kabiirk

Hi,

Facing some challenges

Challenge 1

While building the website from source (pandas/web) with the the below command :

C:/path_to_local_pandas_fork/pandas/web> python pandas_web.py pandas

The static site is being generated, but looks like this Pandas WEB: Obfuscate workgroup email addresses

while the Official site looks like this: Pandas WEB: Obfuscate workgroup email addresses

Potential Reason

Pandas WEB: Obfuscate workgroup email addresses

Way Forward

I think This is a rate-limiting & a CSS thing, Since my main work is with Emails, I don't think this should be a problem. I'll carry on with my work. But since I am going to run this command frequently during testing, I hope there would be no issues if I do that ?


Challenge 2

Also, while initially building the static site, I got the following error at 4 instances: UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

the LOC instances which caused these errors were : https://github.com/pandas-dev/pandas/blob/11d856f52689998bf8c5427e2f9168452a44f8e9/web/pandas_web.py#L113-L114 https://github.com/pandas-dev/pandas/blob/11d856f52689998bf8c5427e2f9168452a44f8e9/web/pandas_web.py#L343-L344 https://github.com/pandas-dev/pandas/blob/11d856f52689998bf8c5427e2f9168452a44f8e9/web/pandas_web.py#L415-L416 https://github.com/pandas-dev/pandas/blob/11d856f52689998bf8c5427e2f9168452a44f8e9/web/pandas_web.py#L425-L426

so I did some troubleshooting & found out that this error is caused because because we aren't telling the open() call what codec to use when reading the file. Because of this the file is opened with the system default codec, which is OS dependent.

Potential Reason

OS : Windows 10 Home Single Language This maybe because my OS's default character encoding codec is not utf-8.

Possible Fix [This has only been implemented in my local Branch] :

In all 4 instances, I modified pandas_web.py by specifying the character encoding codec while opening the file solved this issue for me i.e. explicitly telling open that we are reading a utf-8 encoded file. For e.g. I did something like:

with open(filepath, encoding='utf8') as f:
                f.write(content)

Way Forward

After I am done with the current issue I am working on, Should I open a separate issue for this ?

Comment From: Kabiirk

Hi,

I have implemented a JavaScript based solution for protecting workgroup emails: Pandas WEB: Obfuscate workgroup email addresses

Which looks the same as current workgroup email on the website. mailto: is also functioning when I hover & click on the email: Pandas WEB: Obfuscate workgroup email addresses

To test this, I wrote a Web Scraper in BeautifulSoup which was not able to detect Workgroup email IDs (both after mailto: & between <a></a> tags) when webpage used my JavaScript implementation to write emails. The result is as follows : Pandas WEB: Obfuscate workgroup email addresses

Note : This will not stop all scraping bots, but should be able to stop a lot of them while at the same time be simple to maintain and easily accessible by the end-user.

Please let me know if I can go ahead and make the PR.

Regards.

Comment From: datapythonista

Thanks for the work on this @Kabiirk, sounds great. Sure, go ahead and open the PR (for next time, feel free to open a PR anytime, even if you're unsure of the approach...). You can tag me on it.

Comment From: Kabiirk

@datapythonista Thanks for the help 😄 ! I'll keep that in mind. Opening the PR in a while.

Please do let me know if I should open an Issue for the UnicodeDecodeError I was facing while building the website from pandas_web.py ? Or fix it in this PR itself ?