| Abstract :
||The internship assignment consists of the building and maintenance of a web scraper. The goal of this assignment is to collect Big Data sets from social media websites. The Institute then uses the Big Data sets to perform dialect analyses on them.
There are many web scraping tools available, they often have different features, and sometimes they can be quite costly. At first a web scraping technology has to be selected in according to the needs of the assignment. In this case the tool has to be able to scrap data, filter data and subsequently store it in a database. After the comparison of some web scraping tools, the best one is selected and implemented.
The focus of the research assignment is on Personally Identifiable Information. This type of information can be found almost everywhere on the worldwide web, especially on social media. Most people do not understand the possible danger of having their personal information falling into the wrong hands. A literature study explains the definition of Personally Identifiable Information, the difference with Personal Data defined by the General Data Protection Regulation, and demonstrates how criminals can (ab)use Personally Identifiable Information.
Furthermore, there is a basic principle for training a model that could be used to recognise PII in the data sets that are collected by the web scraper.