Web scraping and data protection

Inhalt

Web scraping – a technique that enables the massive and systematic collection of data accessible online through the use of automated software – has recently raised significant concerns regarding the protection of personal data. Although widely used to feed artificial intelligence models and for business intelligence purposes, this practice poses substantial and concrete risks to the confidentiality of personal data, particularly regarding special categories of data under Article 9 of the European Regulation 2016/679 (“GDPR”).

Web Scraping: Under the Spotlight of Privacy Authorities
Sixteen Data Protection Authorities from various parts of the world, in collaboration with some of the largest global social media companies, have worked together to gather insights and define guidelines on practices and instructions regarding data scraping and privacy protection. This joint effort led to the publication of a Joint Declaration, continuing the initiative launched in the summer of 2023. The document, in addition to outlining general principles, provides operational guidance and practical tools specifically tailored to the needs of small and medium-sized enterprises (SMEs), supporting them in implementing appropriate prevention and mitigation measures against mass data extraction.

Starting with the definition of web scraping, the article examines the decisions of the Italian Data Protection Authority and international authorities on web scraping, providing practical guidance both for those who use scraping techniques and for website managers looking to protect their content from unauthorized extraction.

Guidelines from the Italian Privacy Authority

The Italian Data Protection Authority recently published an informative note providing precise guidance on how to protect personal data published online by public and private entities from so-called web scraping, the systematic extraction of data from websites using bots. Although non-mandatory, the Italian Authority’s guidelines represent a valuable tool for Data Controllers aiming to protect their information, as well as for operators who use web scraping as a business model or tool to implement their business strategies.

In the Guidelines, the Authority defines web scraping as the massive and indiscriminate collection, storage, and retention of data, including personal data. This technique, as clarified, is not inherently illegal, provided that the data subject to scraping is freely accessible on websites and used for statistical or content monitoring purposes. Given the rise of unauthorized web scraping practices, the Authority provides several measures to counter or at least mitigate unauthorized web scraping, specifically:

  • Creation of restricted areas where data is accessible only after registration, thus limiting the indiscriminate availability of data and reducing scraping activities;
  • Insertion of ad hoc clauses in service terms to legally prohibit scraping as a preventive measure;
  • Network traffic monitoring, particularly monitoring HTTP requests, to identify unusual data flows and implementing techniques such as rate limiting;
  • Bot limitations through CAPTCHA verification, periodic HTML markup changes, and embedding content within multimedia objects.

The Authority’s guidelines represent a significant step in protecting personal data in the context of web scraping and confirm the importance of addressing this phenomenon with specific attention to applicable privacy regulations.

Guidelines from the Dutch Privacy Authority

The interest in web scraping and its privacy implications is not limited to the Italian Authority. In May 2021, the Dutch Data Protection Authority took a position on web scraping by publishing guidelines for operators. Specifically, the Dutch Authority’s document explores the web scraping phenomenon, analyzing its legal implications and privacy risks in compliance with GDPR.

The Dutch Authority highlights how web scraping, often employed for commercial or technological development purposes, inherently involves collecting large amounts of personal data, including special categories of data, potentially causing harm to fundamental rights and freedoms of individuals. The Authority emphasizes that, for compliance with GDPR, the processing of data collected through scraping must rely on a suitable legal basis – such as the Data Controller’s legitimate interest – whose applicability must be carefully evaluated through a balancing test with the fundamental rights of data subjects.

Particularly notable are the Authority’s recommendations regarding the implementation of the principle of transparency through comprehensive and exhaustive privacy notices and timely communication to data subjects. It is crucial to limit data processing and comply with the principle of minimization, which requires personal data to be adequate, relevant, and limited to what is necessary for the purposes for which they are collected and used. Additional suggested measures include pseudonymization, immediate removal of unnecessary data, and adherence to technical standards such as the robots.txt protocol.

The Authority also illustrates the limits of web scraping, emphasizing that in many situations, it may violate GDPR principles, especially when conducted without consent or for unjustifiable purposes, such as unauthorized profiling or online activity monitoring.

To ensure regulatory compliance, a Data Protection Impact Assessment (DPIA) is essential, particularly in cases of large-scale use or involving special categories of data.

Joint Declaration on Data Scraping

As noted, data scraping is undoubtedly a key issue in the debate on personal data protection. While this practice can be legitimately employed in certain contexts, its unauthorized use can result in significant privacy violations. In this context, Data Protection Authorities from 16 countries released a new joint declaration on data scraping and privacy protection, following up on their 2023 statement, which already stressed the urgency of protecting online platforms from unauthorized extraction and urged businesses to implement robust controls to block illicit activities.

This initiative is part of a broader global regulatory focus aimed at regulating and mitigating risks associated with automated personal data extraction from digital platforms. The declaration, developed with contributions from Data Protection Authorities in various jurisdictions (including Canada, the United Kingdom, China, Switzerland, and Norway) and the active participation of leading social media operators (including Meta, LinkedIn, and X), provides critical operational guidance on protection against unauthorized web scraping initiatives.

The declaration highlights that companies must comply with data protection regulations when using information extracted through scraping techniques for AI model development or commercial purposes. It also suggests adopting dynamic and multi-level security measures to prevent unauthorized scraping.

Recommended measures include:

  • Use of CAPTCHA systems;
  • IP address blocking systems (also suggested by the Italian Privacy Authority);
  • Use of AI-based tools to identify and prevent anomalous behavior;
  • Designing platforms/websites with intrinsic mechanisms to hinder automated data extraction.

The joint declaration represents a significant step forward in international collaboration to address the challenges posed by web scraping. Technology sector companies must adopt a proactive approach to ensure regulatory compliance, protect users‘ personal data, and implement advanced security measures. Collaboration between Data Protection Authorities and the private sector remains essential to mitigate the risks associated with these practices and promote a safer, more transparent digital ecosystem.

Conclusions

In conclusion, recent initiatives by Data Protection Authorities demonstrate that the web scraping phenomenon requires a coordinated response from businesses and regulators. The recommended strategy goes beyond reacting to existing violations and involves a proactive, multi-level approach that combines technological innovation, corporate policies, and regulatory compliance.

Download Area
Scarica il PDF
Download
Datum
Sprich mit unseren Experten