web scraper Archives - Data v. Privacy

Hot on the heels of more news of Facebook’s data leak affecting 533 million users, we continue to hear about a similarly-sized “breach” of LinkedIn data that affects another 500 million or so users. The word “breach” is being thrown around in reports of both incidents when the data appears to have been (at least, somewhat) publicly accessible.

Are these data events actually breaches? Should they be considered breaches?

Facebook’s 533 Million User Records

According to Facebook, at some point prior to September 2019, someone scraped over 530 million Facebook users’ “publicly available data” using Facebook’s own contact importer tool.

At the time, you could upload phone numbers en masse to see which ones matched existing Facebook users. This vulnerability is what Facebook believes led to the “public access” of its users’ data.

Facebook is adamant that the data was public and, thus, fair game for scrapers to access. Private information was not scraped. Accordingly, Facebook’s position is that there was no breach. However, this explanation doesn’t necessarily match other news reports.

LinkedIn’s 500 Million User Database

As for the LinkedIn data breach/leak, all reports so far seem to confirm that same factual scenario. Data of some 500 million users was scraped from publicly accessible pages on LinkedIn.com.

On April 6, 2021, the LinkedIn dataset was then spotted up for sale on a hacker forum for a “4 digit minimum price.”

Multiple sources confirmed a sample set of the data was legit and then the “LinkedIn Data Breach!” articles started making their rounds.

LinkedIn took the same public position as Facebook did with its phone-number data incident — this was no “data breach.”

What is Data Scraping?

Depending on who you ask, data scraping is either a simple tool to efficiently aggregate data that’s on the web or a nefarious process by which cybercriminals steal data in violation of a website’s terms of service. In many cases, it’s a gray area somewhere in between.

Commonly, a program is used to automatically load (and even interact with) a webpage and gather data from the page. That data is then collected in a file or database for further processing or distribution.

There are plenty of code-based ways to scrape data from relatively simple python scripts with Beautiful Soup or Selenium to a vast array of commercial SaaS applications like Web Scraper or ParseHub. These tools allow the user to target specific portions of web pages and then parse the data to fit their needs.

Is Data Scraping Legal?

Again, it depends on who you ask. However, publicly-facing websites in the United States have a bit of an uphill battle in light of the recent 9th Circuit Court of Appeals ruling in hiQ Labs, Inc. v. LinkedIn Corp., No. 17-16783 (9th Cir. 2019).

hiQ Labs v. LinkedIn: A Data Scraping Case for the Ages

In hiQ Labs v. LinkedIn, the Ninth Circuit addressed a dispute between the two companies over hiQ’s business model that involved scraping the LinkedIn public-facing website. If this is an area relevant to you, it’s a great case to read but the facts and analysis are a bit lengthy – weighing in at 38 pages for the full case.

The short of the dispute appears to have surfaced after a dustup when LinkedIn started offering a competing product to what hiQ was providing to customers. Until then, the parties appeared to cooperate somewhat. LinkedIn threatened to sue hiQ if it didn’t stop scraping data, which was a technical violation of LinkedIn’s terms of service.

hiQ then sued LinkedIn, seeking an injunction that would prevent LinkedIn from blocking hiQ from its website. Additionally, hiQ asked for a declaratory judgment that LinkedIn couldn’t invoke the Computer Fraud and Abuse Act (CFAA), which it had previously threatened in a cease and desist letter.

The federal district court granted a preliminary injunction in hiQ’s favor, which the Ninth Circuit affirmed. The case is now awaiting a certiorari decision from the US Supreme Court.

Whatever the outcome, this fight is far from over as the battle is presently only focused on whether a preliminary injunction is appropriate. However, the Ninth Circuit made some important and relevant statements surrounding the legality of data scraping on public websites.

hiQ’s business interests outweigh the privacy interests of LinkedIn’s users.
LinkedIn’s act of barring hiQ’s access may amount to a tortious interference of contract claim for the contracts between hiQ and its customers.
“It is likely that when a computer network generally permits public access to its data, a user’s accessing that publicly available data will not constitute access without authorization under the CFAA.”

The Ninth Circuit did give LinkedIn a tip – even if the CFAA is unlikely to apply – that “state law trespass to chattels claims may still be available.”

A state law trespass to chattels claim might be the only way to prevent continuous scraping of your webservers sitting in your own data center?

hiQ’s Business Interest vs. Darkweb Data Broker’s “4 digit minimum price”

The Ninth Circuit’s bright-line public data access characterization makes it tough to tackle a nefarious data scraper with a CFAA charge — like for LinkedIn’s aggregated data that’s currently up for sale. It’s also consistent with LinkedIn’s position that the scrape was not a data breach.

Unless the Supreme Court reverses this line of holdings, the CFAA is going to be useless against bad actor data scrapers. (As a quick aside, we’ll get a sneak peek at the current Supreme Court’s feelings on the CFAA’s meaning of “without authorization” in the upcoming Van Buren v. United States ruling, which was argued before the Court back in November 2020.)

Turning the CFAA into a “terms of service violation = federal felony” law seems like it would open a can of worms that will yield disastrous results. The data scraping balancing act likely needs to fall somewhere in the gray area between welcomed public service actions (e.g., COVID-19 data research and aggregation, etc.) and aggregate-personal-data-for-sale-on-the-dark-web.

In the US, it seems like a comprehensive federal privacy law might be able to address a matter like this. Important questions for Congress to consider when addressing these data scraping issues are:

determining where the line is drawn on the good-actor/bad-actor spectrum of scrapers
how (as hiQ notes in its certiorari opposition brief) chilling effects of First Amendment access to publicly accessible speech are balanced, and
determining whether and when the disclosure of aggregated personal data that is otherwise individually available on a public-facing website becomes a data breach (and who has the responsibility of disclosure to those affected)