Data Sources – AI for Cybersecurity in Banking

DFIR Report: Digital Forensics and Incident Response Report

A company dedicated to provide real intrusion by providing reports, analysis, and services to experts.
Data is generated by members and monitored/compiled by a team who held CTO positions and/or focus on information security.
Site contains many reports. One of the reports is exclusively for ransomware.
Ransomware encrypts company files bringing bank services to a halt as demonstrated in our first diamond model. DFIR Report provides the appropriate data to mine for our research. This dataset contains a collection of malware files that are used for ransomware which can be used in combination with emails to identify if an email is malicious or not.

Intrusion Detection Evaluation Dataset (CIC-IDS2017)

Access through Kaggle and Canadian Institute for Cybersecurity
This dataset provides up to date attacks with realistic background traffic for network attack analysis. The dataset was built using 25 profiles of typical human behavior on a network based on HTTP, HTTPS, FTP, SSH, and email protocols.
The dataset is helpful for financial institutions to simulate network intrusion and test AI models to reduce false alarms and increase the portion of fraudulent cases detected.
Data was generated by researchers to use for machine and deep learning cybersecurity models.
The dataset being used: “Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX”

Phishing Email Detection Dataset

This dataset on Kaggle is meant as a training device for machine learning tools to determine which emails are potential phishing emails based on the content of the body of the email.
The dataset is three months old, so is relevant and not obsolete.
CSV file contains over 28,000 rows of email body text, and classification if the email is a phishing or safe email. Dataset is important and relevant because phishing appears in our diamond models, and can serve as an entryway for attackers to launch other attacks.

CTU Mixed Capture 5 Dataset

Stratosphere Laboratory is an organization which originated at the Czech Technical University of Prague, as a continuation of the work of a PhD student. Stratosphere IPS offers a free machine learning based Intrusion prevention system, along with other ongoing projects, and publicly available datasets.
This dataset is 173MB and contains packet capture data from a simulated malware attack.
Data is from 2015 Valuable due to its complexity and comprehensiveness. This dataset should provide a valuable snapshot of a realistic malware use case.

JavaScript Vulnerability Dataset

Accessed through GitHub
The JavaScript Vulnerability dataset contains vulnerability information in public databases of the Node Security Project and the Snyk platform. (12,126 rows)
JavaScript is gaining popularity as a programming language for server-side web application, mobile app and IoT implementation.
The wide scale adoption of third-party packages by code developers such as those stored by the Node Package Manager (npm) increases Javascript vulnerabilities.
The dataset can be used for building prediction models to determine whether Javascript functions and the associated static source code metrics contain vulnerability or not.
This can benefit our work since JavaScript based IoT devices were included amongst our diamond models.

Malicious URLs Dataset

This dataset on Kaggle is meant as a training device to develop machine learning models that help determine which URLs have malicious content.
This is a large CSV file with a dataset of 65,1191 URLs of which 42,8103 URLs are considered safe.
Malicious URLs can host unsolicited threats that can lead to malware installation, phishing and theft of private banking information.
This dataset is important and relevant because malware installation, phishing and theft of private banking information all appear in our diamond models and can provide fertile grounds for adversarial attacks