README.md

Environment
- Venv
- Poetry
Collecting data
- Setting up GMail
- Setting up S3 (optional)
- AWS Lambda code
Visualizing / exploring data
- Any senders that show up multiple times?
- Any known phishing links?
  - use Google Safe Browsing API
Models - spam vs. not spam
- a) Bag-of-words + Multinomial Naive Bayes
- b) LSTM
...

0. Environment

Venv

python -m venv ./venv

Poetry

poetry export -f requirements.txt --outputrequirements.txt

Collecting data

Generate an application password for your gmail account. I called mine "AWS Lambda", but you can call it whatever you want. To do this, go to your GMail account (click on profile picture) > "Manage your google account" > Security tab > Signing in to Google > App Passwords, then create a password for Mail.
Set up IMAP. In Gmail itself, click on the gear icon to open Settings. Then go to the "Forwarding and POP/IMAP" tab. Scroll down to IMAP Access and make sure IMAP is enabled.
Create dynamodb table (use provided script)
Set up Lambda layer - reference. Running generate_lambda_deployment_package.sh does all this for you. Then add the layer to the lambda (in console -> scroll down below the Cloud9 editor to Layers and attach the layer)

make sure to increase timeout to ~15 mins

Data exploration

Note : 677 emails are labeled as 'spam' out of X emails total ("Messages that have been in Spam more than 30 days will be automatically deleted. ") - therefore, for this exercise I'll also try to frame as an anomaly detection problem. In practice, over X% of email is spam (SOURCE?).
For a quick comparison - in the last 30 days, I have 677 spam emails but X 'real' inbox emails, and of the real inbox emails, Y are from mailing lists / coupon lists / stores

Models

Multinomial N.B - spam vs. nonspam

LSTM - spam vs. nonspam

Other

TODO - decide various ways to explore data (classifying spam category? ex. phishing vs social engineering)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
gmail_spam		gmail_spam
.gitignore		.gitignore
README.md		README.md
gen_layer.sh		gen_layer.sh
generate_lambda_deployment_package.sh		generate_lambda_deployment_package.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README.md

0. Environment

Venv

Poetry

Collecting data

Data exploration

Models

Multinomial N.B - spam vs. nonspam

LSTM - spam vs. nonspam

Other

About

Releases

Packages

Languages

majamil16/gmail-spam

Folders and files

Latest commit

History

Repository files navigation

README.md

0. Environment

Venv

Poetry

Collecting data

Data exploration

Models

Multinomial N.B - spam vs. nonspam

LSTM - spam vs. nonspam

Other

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages