Privacy

A Privacy Framework for Research Using Social Media Data

Social media data enables researchers to understand current events and human behavior with unprecedented ease and scale. Yet, researchers often violate user privacy when they access, process, and store sensitive information contained within social media data.

Social media has proved largely beneficial for research. Studies measuring the spread of COVID-19 and predicting crime highlight the valuable insights that are not possible without the Big Data present on social media. Lurking within these datasets, though, are intimate personal stories related to sensitive topics such as sexual abuse, pregnancy loss, and gender transition— both anonymously and in identifying ways.

In our latest paper presented at the 2025 IEEE Symposium on Security and Privacy, “SoK: A Privacy Framework for Security Research Using Social Media Data,” we examine the tensions in social media research between pursuing better science enabled by social media data and protecting social media users’ privacy. We focus our efforts on security and privacy research as it can involve sensitive topics such as misinformation, harassment, and abuse as well as the need for security researchers to be held to a higher standard when revealing vulnerabilities that impact companies and users.

Methodology

Toward this end, we conducted a systematic literature review of security literature on social media. We collected over 10,000 papers from six different disciplines, Computer Security and Cryptography (CSC), Data Mining and Analysis (DMA), Human-Computer Interaction (HCI), Humanities, Literature & Arts, Communication, (HLAC), Social Sciences, Criminology (SSC), and Social Sciences, Forensic Science (SSFS). Our final dataset included 601 papers across 16 years after iterating through several screening rounds.

How do security researchers handle privacy of social media data?

Our most alarming finding is that only 35% of papers mention any considerations of data anonymization, availability, and storage. This means that security and privacy researchers are failing to report how they handle user privacy.

Continue reading A Privacy Framework for Research Using Social Media Data

A well-executed exercise in snake oil evaluation

In the umpteenth chapter of UK governments battling encryption, Priti Patel in September 2021 launched the “Safety Tech Challenge”. It was to give five companies £85K each to develop “innovative technologies to keep children safe when using end-to-end encrypted messaging services”. Tasked with evaluating the outcomes was the REPHRAIN project, the consortium given £7M to address online harms. I had been part of the UKRI 2020 panel awarding this grant, and believed then and now that it concerns a politically laden and technically difficult task, that was handed to a group of eminently sensible scientists.¹ While the call had strongly invited teams to promise the impossible in order to placate the political goals, this team (and some other consortia too) wisely declined to do so, and remained realistic.

The evaluation results have now come back, and the REPHRAIN team have done a very decent job given that they had to evaluate five different brands of snake oil with their hands tied behind their backs. In doing so, they have made a valuable contribution to the development of trustworthy AI in the important application area of online (child) safety technology.

The Safety Tech Challenge

The Safety Tech Challenge was always intellectually dishonest. The essence of end-to-end encryption (E2EE) is that nothing² can be known about encrypted information by anyone other than the sender and receiver. Not whether the last bit is a 0, not whether the message is CSAM (child sexual abuse material).³ The final REPHRAIN report indeed states there is “no published research on computational tools that can prevent CSAM in E2EE”.

In terms of technologies, there really also is no such thing as “in the context of E2EE”: the messages are agnostic as to whether they are about to be encrypted (on the sender side) or have just been decrypted (on the receiving side), and nothing meaningful can be done⁴ in between; any technologies that can be developed are agnostic of when they get invoked.

Continue reading A well-executed exercise in snake oil evaluation

What is Synthetic Data? The Good, the Bad, and the Ugly

Sharing data can often enable compelling applications and analytics. However, more often than not, valuable datasets contain information of sensitive nature, and thus sharing them can endanger the privacy of users and organizations.

A possible alternative gaining momentum in the research community is to share synthetic data instead. The idea is to release artificially generated datasets that resemble the actual data — more precisely, having similar statistical properties.

So how do you generate synthetic data? What is that useful for? What are the benefits and the risks? What are the fundamental limitations and the open research questions that remain unanswered?

All right, let’s go!

How To Safely Release Data?

Before discussing synthetic data, let’s first consider the “alternatives.”

Anonymization: Theoretically, one could remove personally identifiable information before sharing it. However, in practice, anonymization fails to provide realistic privacy guarantees because a malevolent actor often has auxiliary information that allows them to re-identify anonymized data. For example, when Netflix de-identified movie rankings (as part of a challenge seeking better recommendation systems), Arvind Narayanan and Vitaly Shmatikov de-anonymized a large chunk by cross-referencing them with public information on IMDb.

Continue reading What is Synthetic Data? The Good, the Bad, and the Ugly

“I am yet to meet a young person that has not experienced some form of abuse via tech”

Technology-facilitated abuse describes the misuse of digital systems such as smartphones or other Internet-connected devices to monitor, control and harm individuals. In recent years increasing attention has been given to this phenomenon in school settings and the criminal justice system. Yet, an awareness in the healthcare sector is lacking. To address this gap, Dr Isabel Straw and Dr Leonie Tanczer from University College London (UCL) have been leading a new research project that examines technology-facilitated abuse in medical settings.

Technology-facilitated forms of abuse are on the rise, with perpetrators adapting digital technologies such as smartphones and drones, trackers such as AirTags, and spyware tools including parental control software, to cause harm. The impact of technology-facilitated abuse on patients may not always be immediately obvious to healthcare professionals. For instance, smart, Internet-connected devices have been showcased to be misused in domestic abuse cases to inflict physical harm. Smart locks have been used to trap individuals inside their homes, smart thermostats have been used to inflict extremes of temperature on victims, and remotely controlled lighting and sound systems have been manipulated to cause psychological distress. COVID-19 catalyzed the proliferation of these technologies within our environment, with sales of smart devices increasing 30% on last year. Yet, while these tools are advertised for their proposed safety and convenience, they are also providing new avenues for violence, harassment, and abuse.

The impact of technology-facilitated abuse is especially notable on young people. In recent years, pediatric safeguarding guidelines have been amended in response to increasing rates of knife crime, gang violence and drug trafficking in the UK. However, technology-facilitated abuse has evolved at a parallel rate and has not received the same level of attention. The impact of technology-facilitated abuse on children and teenagers may manifest as emotional distress, anxiety, suicidal ideation. Koubel reports the exacerbation of mental health risks born from websites that encourage self-harm, eating disorders, and suicide. Furthermore, technology-facilitated dating abuse and sextortion is increasing amongst adolescent populations. With 10% of children being affected by sexual solicitation online, the problem is widespread and under-investigated. As reported by Stonard et al. in “They’ll Always Find a Way to Get to You”, digital devices are playing an increasing role in relationship abuse amongst young people.

Vulnerable individuals frequently perceive medical settings as a place of safety. Healthcare professionals, thus, have a role in providing both medical and psychosocial care to ensure their wellbeing. At present, existing clinical and patient management protocols are outdated and do not address the emerging threats of technology-facilitated abuse. For clinicians to provide effective care to patients affected by technological elements of abuse and violence, clinical safeguarding protocols need a radical update if they are to assist professionals navigating high risk scenarios.

Continue reading “I am yet to meet a young person that has not experienced some form of abuse via tech”

Pre-loading HSTS for sibling domains through this one weird trick

The vast majority of websites now support encrypted connections over HTTPS. This prevents eavesdroppers from monitoring or tampering with people’s web activity and is great for privacy. However, HTTPS is optional, and all browsers still support plain unsecured HTTP for when a website doesn’t support encryption. HTTP is commonly the default, and even when it’s not, there’s often no warning when access to a site falls back to using HTTP.

The optional nature of HTTPS is its weakness and can be exploited through tools, like sslstrip, which force browsers to fall back to HTTP, allowing the attacker to eavesdrop or tamper with the connection. In response to this weakness, HTTP Strict Transport Security (HSTS) was created. HSTS allows a website to tell the browser that only HTTPS should be used in future. As long as someone visits an HSTS-enabled website one time over a trustworthy Internet connection, their browser will refuse any attempt to fall back to HTTP. If that person then uses a malicious Internet connection, the worst that can happen is access to that website will be blocked; tampering and eavesdropping are prevented.

Still, someone needs to visit the website once before an HSTS setting is recorded, leaving a window of opportunity for an attacker. The sooner a website can get its HSTS setting recorded, the better. One aspect of HSTS that helps is that a website can indicate that not only should it be HSTS enabled, but that all subdomains are too. For example, planet.wikimedia.org can say that the subdomain en.planet.wikimedia.org is HSTS enabled. However, planet.wikimedia.org can’t say that commons.wikimedia.org is HSTS enabled because they are sibling domains. As a result, someone would need to visit both commons.wikimedia.org and planet.wikimedia.org before both websites would be protected.

What if HSTS could be applied to sibling domains and not just subdomains? That would allow one domain to protect accesses to another. The HSTS specification explicitly excludes this feature, for a good reason: discovering whether two sibling domains are run by the same organisation is fraught with difficulty. However, it turns out there’s a way to “trick” browsers into pre-loading HSTS status for sibling domains.

Continue reading Pre-loading HSTS for sibling domains through this one weird trick

Apple letting the content-scanning genie out of the bottle

When Apple announced that they would be scanning iPhones for child sexual abuse material (CSAM), the push-back appears to have taken them by surprise. Since then, Apple has been engaging with experts and developing their proposals to mitigate risks that have been raised. In this post, I’ll discuss some of the issues with Apple’s CSAM detection system and what I’ve learned from their documentation and events I’ve participated in.

Technically Apple’s CSAM detection proposal is impressive, and I’m pleased to see Apple listening to the community to address issues raised. However, the system still creates risks that will be difficult to avoid. Governments are likely to ask to expand the system to types of content other than CSAM, regardless of what Apple would like to happen. When they do, there will be complex issues to deal with, both for Apple and the broader technology community. The proposals also risk causing people to self-censor, even when they are doing nothing wrong.

How Apple’s CSAM detection works

The iPhone or iPad scans images for known CSAM just before it uploads the image to Apple’s cloud data storage system – iCloud. Images that are not going to be uploaded don’t get scanned. The comparison between images and the database is made in such a way that minor changes to CSAM, like resizing and cropping, will trigger a match, but any image that wasn’t derived from a known item of CSAM should be very unlikely to match. The results of this matching process go into a clever cryptographic system designed to ensure that the user’s device doesn’t learn the contents of the CSAM database or which of their images (if any) match. If more than a threshold of about 30 images match, Apple will be able to verify if the matching images are CSAM and, if so, report to the authorities. If the number of matching images is less than the threshold, Apple learns nothing.

Risk of scope creep

Now that Apple has built their system, a risk is that it could be extended to search for content other than CSAM by expanding the database used for matching. While some security properties of their system are ensured through cryptography, the restriction to CSAM is only a result of Apple’s policy on the content of the matching database. Apple has clearly stated that it would resist any expansion of this policy, but governments may force Apple to make changes. For example, in the UK, this could be through a Technical Capability Notice (under the Investigatory Powers Act) or powers proposed in the Online Safety Bill.

If a government legally compelled them to expand the matching database, Apple may have to choose between complying or leaving the market. So far, Apple has refused to say which of these choices they would take.

Continue reading Apple letting the content-scanning genie out of the bottle

Measuring mobility without violating privacy – a case study of the London Underground

In the run-up to this year’s Privacy Enhancing Technologies Symposium (PETS 2019), I noticed some decidedly non-privacy-enhancing behaviour. Transport for London (TfL) announced they will be tracking the wifi MAC addresses of devices being carried on London Underground stations. Before storing a MAC address it will be hashed with a key, but since this key will remain unchanged for an extended period (2 years), it will be possible to track the movements of an individual over this period through this pseudonymous ID. These traces are likely enough to link records back to the individual with some knowledge of that person’s distinctive travel plans. Also, for as long as the key is retained it would be trivial for TfL (or someone who stole the key) to convert the someone’s MAC address into its pseudonymised form and indisputably learn that that person’s movements.

TfL argues that under the General Data Protection Regulations (GDPR), they don’t need the consent of individuals they monitor because they are acting in the public interest. Indeed, others have pointed out the value to society of knowing how people typically move through underground stations. But the GDPR also requires that organisations minimise the amount of personal data they collect. Could the same goal be achieved if TfL irreversibly anonymised wifi MAC addresses rather than just pseudonymising them? For example, they could truncate the hashed MAC address so that many devices all have the same truncated anonymous ID. How would this affect the calculation of statistics of movement patterns within underground stations? I posed these questions in a presentation at the PETS 2019 rump session, and in this article, I’ll explain why a set of algorithms designed to violate people’s privacy can be applied to collect wifi mobility information while protecting passenger privacy.

It’s important to emphasise that TfL’s goal is not to track past Underground customers but to predict the behaviour of future passengers. Inferring past behaviours from the traces of wifi records may be one means to this end, but it is not the end in itself, and TfL creates legal risk for itself by holding this data. The inferences from this approach aren’t even going to be correct: wifi users are unlikely to be typical passengers and behaviour will change over time. TfL’s hope is the inferred profiles will be useful enough to inform business decisions. Privacy-preserving measurement techniques should be judged by the business value of the passenger models they create, not against how accurate they are at following individual passengers around underground stations in the past. As the saying goes, “all models are wrong, but some are useful”.

Simulating privacy-preserving mobility measurement

To explore this space, I built a simple simulation of Euston Station inspired by one of the TfL case studies. In my simulation, there are two platforms (A and B) and six types of passengers. Some travel from platform A to B; some from B to A; others enter and leave the station at one platform (A or B). Of the passengers that travel between platforms, they can take either the fast route (taking 2 minutes on average) or the slow route (taking 4 minutes on average). Passengers enter the station at a Poisson arrival rate averaging one per second. The probabilities that each new passenger is of a particular type are shown in the figure below. The goal of the simulation is to infer the number of passengers of each type from observations of wifi measurements taken at platforms A and B.

Continue reading Measuring mobility without violating privacy – a case study of the London Underground

Tracing transactions across cryptocurrency ledgers

The Bitcoin whitepaper specifies the risks of revealing owners of addresses. It states that “if the owner of a key is revealed, linking could reveal other transactions that belonged to the same owner.” Five years later, we have seen many projects which look at de-anonymising entities in Bitcoin. Such projects use techniques such as address tagging and clustering to tie many addresses to one entity, making it easier to analyse the movement of funds. However, this is not only limited to Bitcoin but also occurs on alternative cryptocurrencies such as Zcash and Monero. Thus tracing transactions on-chain is a known and studied problem.

But we have recently seen a shift into entities performing cross-currency trades. For example, the WannaCry hackers laundered over $142,000 Bitcoin from ransoms across cryptocurrencies. The issue here is that cross-chain transactions appear to be indistinguishable from native transactions on-chain. For example, to trade Bitcoin for Monero, one would have to send the exchange bitcoin, and in return, the exchange sends the user some coins in Monero. Both these transactions occur on separate chains and do not appear to be connected, so the actual swap can appear to be obscured. This level of obscurity can be used to hide the original flow of coins, giving users an additional form of anonymity.

Thus it is important to ask whether or not we can analyse such transactions and the extent of the analysis possible, and if so, how? In our paper being presented today at the USENIX Security Symposium, we (Haaroon Yousaf, George Kappos and Sarah Meiklejohn) answer these questions.

Our Research

In summary, we scraped and linked over 1.3 million transactions across different blockchains from the service ShapeShift. In doing so, we found over 100,000 cases where users would convert coins to another currency then move right back to the original one, identified that a Bitcoin address associated with CoinPayments.net address is a very popular service for users to shift to, and saw that scammers preferred shifting their Ethereum to Bitcoin and Monero.

We collected and analysed 13 months of transaction data across eight different blockchains to identify how users interacted with this service. In doing so, we developed new heuristics and identified various patterns of cross-currency trades.

What is ShapeShift?

ShapeShift is a lightweight cross-currency non-custodial service that facilitates trades which allows users to directly trade coins from one currency to another (a cross-currency shift). This service acts as the entity which facilitates the entire trade, allowing users to essentially swap their coins with its own supply. ShapeShift and Changelly are examples of such services.

Continue reading Tracing transactions across cryptocurrency ledgers

Thoughts on the Libra blockchain: too centralised, not private, and won’t help the unbanked

Facebook recently announced a new project, Libra, whose mission is to be “a simple global currency and financial infrastructure that empowers billions of people”. The announcement has predictably been met with scepticism by organisations like Privacy International, regulators in the U.S. and Europe, and the media at large. This is wholly justified given the look of the project’s website, which features claims of poverty reduction, job creation, and more generally empowering billions of people, wrapped in a dubious marketing package.

To start off, there is the (at least for now) permissioned aspect of the system. One appealing aspect of cryptocurrencies is their potential for decentralisation and censorship resistance. It wasn’t uncommon to see the story of PayPal freezing Wikileak’s account in the first few slides of a cryptocurrency talk motivating its purpose. Now, PayPal and other well-known providers of payment services are the ones operating nodes in Libra.

There is some valid criticism to be made about the permissioned aspect of a system that describes itself as a public good when other cryptocurrencies are permissionless. These are essentially centralised, however, with inefficient energy wasting mechanisms like Proof-of-Work requiring large investments for any party wishing to contribute.

There is a roadmap towards decentralisation, but it is vague. Achieving decentralisation, whether at the network or governance level, hasn’t been done even in a priori decentralised cryptocurrencies. In this sense, Libra hasn’t really done worse so far. It already involves more members than there are important Bitcoin or Ethereum miners, for example, and they are also more diverse. However, this is more of a fault in existing cryptocurrencies rather than a quality of Libra.

Continue reading Thoughts on the Libra blockchain: too centralised, not private, and won’t help the unbanked

How Accidental Data Breaches can be Facilitated by Windows 10 and macOS Mojave

Inadequate user interface designs in Windows 10 and macOS Mojave can cause accidental data breaches through inconsistent language, insecure default options, and unclear or incomprehensible information. Users could accidentally leak sensitive personal data. Data controllers in companies might be unknowingly non-compliant with the GDPR’s legal obligations for data erasure.

At the upcoming Annual Privacy Forum 2019 in Rome, I will be presenting the results of a recent study conducted with my colleague Mark Warner, exploring the inadequate design of user interfaces (UI) as a contributing factor in accidental data breaches from USB memory sticks. The paper titled “Fight to be Forgotten: Exploring the Efficacy of Data Erasure in Popular Operating Systems” will be published in the conference proceedings at a later date but the accepted version is available now.

Privacy and security risks from decommissioned memory chips

The process of decommissioning memory chips (e.g. USB sticks, hard drives, and memory cards) can create risks for data protection. Researchers have repeatedly found sensitive data on devices they acquired from second-hand markets. Sometimes this data was from the previous owners, other times from third persons. In some cases, highly sensitive data from vulnerable people were found, e.g. Jones et al. found videos of children at a high school in the UK on a second-hand USB stick.

Data found this way had frequently been deleted but not erased, creating the risk that any tech-savvy future owner could access it using legally available, free to download software (e.g., FTK Imager Lite 3.4.3.3). Findings from these studies also indicate the previous owners’ intentions to erase these files and prevent future access by unauthorised individuals, and their failure to sufficiently do so. Moreover, these risks likely extend from the second-hand market to recycled memory chips – a practice encouraged under Directive 2012/19/EU on ‘waste electrical and electronic equipment’.

The implications for data security and data protection are substantial. End-users and companies alike could accidentally cause breaches of sensitive personal data of themselves or their customers. The protection of personal data is enshrined in Article 8 of the Charter of Fundamental Rights of the European Union, and the General Data Protection Regulation (GDPR) lays down rules and regulation for the protection of this fundamental right. For example, data processors could find themselves inadvertently in violation of Article 17 GDPR Right to Erasure (‘right to be forgotten’) despite their best intentions if they failed to erase a customer’s personal data – independent of whether that data was breached or not.

Seemingly minor design choices, the potential for major implications

The indication that people might fail to properly erase files from storage, despite their apparent intention to do so, is a strong sign of system failure. We know since more than twenty years that unintentional failure of users at a task “is often caused by the way in which [these] mechanisms are implemented, and users’ lack of knowledge”. In our case, these mechanisms are – for most users – the UI of Windows and macOS. When investigating these mechanisms, we found seemingly minor design choices that might facilitate unintentional data breaches. A few examples are shown below and are expanded upon in the full publication of our work.

Continue reading How Accidental Data Breaches can be Facilitated by Windows 10 and macOS Mojave