On Location, Time, and Membership: Studying How Aggregate Location Data Can Harm Users’ Privacy

The increasing availability of location and mobility data enables a number of applications, e.g., enhanced navigation services and parking, context-based recommendations, or waiting time predictions at restaurants, which have great potential to improve the quality of life in modern cities. However, the large-scale collection of location data also raises privacy concerns, as mobility patterns may reveal sensitive attributes about users, e.g., home and work places, lifestyles, or even political or religious inclinations.

Service providers, i.e., companies with access to location data, often use aggregation as a privacy-preserving strategy to release these data to third-parties for various analytics tasks. The idea being that, by grouping together users’ traces, the data no longer contains information to enable inferences about individuals such as the ones mentioned above, while it can be used to obtain useful insights about the crowds. For instance, Waze constructs aggregate traffic models to improve navigation within cities, while Uber provides aggregate data for urban planning purposes. Similarly, CityMapper’s Smart Ride application aims at identifying gaps in transportation networks based on traces and rankings collected by users’ mobile devices, while Telefonica monetizes aggregate location statistics through advertising as part of the Smart Steps project.

That’s great, right? Well, our paper, “Knock Knock, Who’s There? Membership Inference on Aggregate Location Data” and published at NDSS 2018, shows that aggregate location time-series can in fact be used to infer information about individual users. In particular, we demonstrate that aggregate locations are prone to a privacy attack, known as membership inference: a malicious entity aims at identifying whether a specific individual contributed her data to the aggregation. We demonstrate the feasibility of this type of privacy attack on a proof-of-concept setting designed for an academic evaluation, and on a real-world setting where we apply membership inference attacks in the context of the Aircloak challenge, the first bounty program for anonymized data re-identification.

Membership Inference Attacks on Aggregate Location Time-Series

Our NDSS’18 paper studies membership inference attacks on aggregate location time-series indicating the number of people transiting in a certain area at a given time. That is, we show that an adversary with some “prior knowledge” about users’ movements is able to train a machine learning classifier and use it to infer the presence of a specific individual’s data in the aggregates.

We experiment with different types of prior knowledge. On the one hand, we simulate powerful adversaries that know the real locations for a subset of users in a database during the aggregation period (e.g., telco providers which have location information about their clients), while on the other hand, weaker ones which only know past statistics about user groups (i.e., reproducing a setting of continuous data release). Overall, we find that the adversarial prior knowledge influences significantly the effectiveness of the attack.

Continue reading On Location, Time, and Membership: Studying How Aggregate Location Data Can Harm Users’ Privacy

Improving the auditability of access to data requests

Data is increasingly collected and shared, with potential benefits for both individuals and society as a whole, but people cannot always be confident that their data will be shared and used appropriately. Decisions made with the help of sensitive data can greatly affect lives, so there is a need for ways to hold data processors accountable. This requires not only ways to audit these data processors, but also ways to verify that the reported results of an audit are accurate, while protecting the privacy of individuals whose data is involved.

We (Alexander Hicks, Vasilios Mavroudis, Mustafa Al-Basam, Sarah Meiklejohn and Steven Murdoch) present a system, VAMS, that allows individuals to check accesses to their sensitive personal data, enables auditors to detect violations of policy, and allows publicly verifiable and privacy-preserving statistics to be published. VAMS has been implemented twice, as a permissioned distributed ledger using Hyperledger Fabric and as a verifiable log-backed map using Trillian. The paper and the code are available.

Use cases and setting

Our work is motivated by two scenarios: controlling the access of law-enforcement personnel to communication records and controlling the access of healthcare professionals to medical data.

The UK Home Office states that 95% of serious and organized criminal cases make use of communications data. Annual reports published by the IOCCO (now under the IPCO name) provide some information about the request and use of communications data. There were over 750 000 requests for data in 2016, a portion of which were audited to provide the usage statistics and errors that can be found in the published report.

Not only is it important that requests are auditable, the requested data can also be used as evidence in legal proceedings. In this case, it is necessary to ensure the integrity of the data or to rely on representatives of data providers and expert witnesses, the latter being more expensive and requiring trust in third parties.

In the healthcare case, individuals usually consent for their GP or any medical professional they interact with to have access to relevant medical records, but may have concerns about the way their information is then used or shared.  The NHS regularly shares data with researchers or companies like DeepMind, sometimes in ways that may reduce the trust levels of individuals, despite the potential benefits to healthcare.

Continue reading Improving the auditability of access to data requests

“The pool’s run dry” – analyzing anonymity in Zcash

Zcash is a cryptocurrency whose main feature is a “shielded pool” that is designed to provide strong anonymity guarantees. Indeed, the cryptographic foundations of the shielded pool are based in highly-regarded academic research. The deployed Zcash protocol, however, allows for transactions outside of the shielded pool (which, from an anonymity perspective, are identical to Bitcoin transactions), and it can be easily observed from blockchain data that the majority of transactions do not use the pool. Nevertheless, users of the shielded pool should be able to treat it as their anonymity set when attempting to spend coins in an anonymous fashion.

In a recent paper, An Empirical Analysis of Anonymity in Zcash, we (George Kappos, Haaroon Yousaf, Mary Maller, and Sarah Meiklejohn) conducted an empirical analysis of Zcash to further our understanding of its shielded pool and broader ecosystem. Our main finding is that is possible in many cases to identify the activity of founders and miners using the shielded pool (who are required by the consensus rules to put all newly generated coins into it). The implication for anonymity is that this activity can be excluded from any attempt to track coins as they move through the pool, which acts to significantly shrink the effective anonymity set for regular users. We have disclosed all our findings to the developers of Zcash, who have written their own blog post about this research.  This work will be presented at the upcoming USENIX Security Symposium.

What is Zcash?

In Bitcoin, the sender(s) and receiver(s) in a transaction are publicly revealed on the blockchain. As with Bitcoin, Zcash has transparent addresses (t-addresses) but gives users the option to hide the details of their transactions using private addresses (z-addresses). Private transactions are conducted using the shielded pool and allow users to spend coins without revealing the amount and the sender or receiver. This is possible due to the use of zero-knowledge proofs.

Like Bitcoin, new coins are created in public “coingen” transactions within new blocks, which reward the miners of those blocks. In Zcash, a percentage of the newly minted coins are also sent to the founders (a predetermined list of Zcash addresses owned by the developers and embedded into the protocol).

Continue reading “The pool’s run dry” – analyzing anonymity in Zcash

“Wow such genetics. So data. Very forever?” – An overview of the blockchain genomics trend

In 2014, Harvard professor and geneticist George Church said: “‘Preserving your genetic material indefinitely’ is an interesting claim. The record for storage of non-living DNA is now 700,000 years (as DNA bits, not electronic bits). So, ironically, the best way to preserve your electronic bitcoins/blockchains might be to convert them into DNA”. In early February 2018, Nebula Genomics, a blockchain-enabled genomic data sharing and analysis platform, co-founded by George Church, was launched. And they are not alone on the market. The common factor between all of them is that they want to give the power back to the user. By leveraging the fact that most companies that currently offer direct-to-consumer genetic testing sell data collected from their customers to pharmaceutical and biotech companies for research purposes, they want to be the next Uber or Airbnb, with some even claiming to create the Alibaba for life data using the next-generation artificial intelligence and blockchain technologies.

Nebula Genomics

Its launch is motivated by the need of increasing genomic data sharing for research purposes, as well as reducing the costs of sequencing on the client side. The Nebula model aims to eliminate personal genomics companies as the middle-man between the customer and the pharmaceutical companies. This way, data owners can acquire their personal genomic data from Nebula sequencing facilities or other sources, join the Nebula network and connect directly with the buyers.
Their main claims from their whitepaper can be summarized as follows:

  • Lower the sequencing costs for customers by joining the network to profiting from directly by connecting with data buyers if they had their genomes sequenced already, or by participating in paid surveys, which can incentivize data buyers to subsidize their sequencing costs
  • Enhanced data protection: shared data is encrypted and securely analyzed using Intel Software Guard Extensions (SGX) and partially homomorphic encryption (such as the Paillier scheme)
  • Efficient data acquisition, enabling data buyers to efficiently acquire large genomic datasets
  • Being big data ready, by allowing data owners to privately store their data, and introducing space efficient data encoding formats that enable rapid transfers of genomic data summaries over the network

Zenome

This project aims to ensure that genomic data from as many people as possible will be openly available to stimulate new research and development in the genomics industry. The founders of the project believe that if we do not provide open access to genomic data and information exchange, we are at risk of ending up with thousands of isolated, privately stored collections of genomic data (from pharmaceutical companies, genomic corporations, and scientific centers), but each of these separate databases will not contain sufficient data to enable breakthrough discoveries. Their claims are not as ambitious as Nebula, focusing more on the customer profiting from selling their own DNA data rather than other sequencing companies. Their whitepaper even highlights that no valid solutions currently exist for the public use of genomic information while maintain individual privacy and that encryption is used when necessary. When buying ZNA tokens (the cryptocurrency associated with Zenome), one has to follow a Know-Your-Customer procedure and upload their ID/Passport.

Gene Blockchain

The Gene blockchain business model states it will use blockchain smart contracts to:

  • Create an immutable ledger for all industry related data via GeneChain
  • Offer payment for industry related services and supplies through GeneBTC
  • Establish advanced labs for human genome data analysis via GeneLab
  • Organize and unite global platform for health, entertainment, social network and etc. through GeneNetwork

Continue reading “Wow such genetics. So data. Very forever?” – An overview of the blockchain genomics trend

Coconut: Threshold Issuance Selective Disclosure Credentials with Applications to Distributed Ledgers

Selective disclosure credentials allow the issuance of a credential to a user, and the subsequent unlinkable revelation (or ‘showing’) of some of the attributes it encodes to a verifier for the purposes of authentication, authorisation or to implement electronic cash. While a number of schemes have been proposed, these have limitations, particularly when it comes to issuing fully functional selective disclosure credentials without sacrificing desirable distributed trust assumptions. Some entrust a single issuer with the credential signature key, allowing a malicious issuer to forge any credential or electronic coin. Other schemes do not provide the necessary re-randomisation or blind issuing properties necessary to implement modern selective disclosure credentials. No existing scheme provides all of threshold distributed issuance, private attributes, re-randomisation, and unlinkable multi-show selective disclosure.

We address these challenges in our new work Coconut – a novel scheme that supports distributed threshold issuance, public and private attributes, re-randomization, and multiple unlinkable selective attribute revelations. Coconut allows a subset of decentralised mutually distrustful authorities to jointly issue credentials, on public or private attributes. These credentials cannot be forged by users, or any small subset of potentially corrupt authorities. Credentials can be re-randomised before selected attributes being shown to a verifier, protecting privacy even in the case all authorities and verifiers collude.

Applications to Smart Contracts

The lack of full-featured selective disclosure credentials impacts platforms that support ‘smart contracts’, such as Ethereum, Hyperledger and Chainspace. They all share the limitation that verifiable smart contracts may only perform operations recorded on a public blockchain. Moreover, the security models of these systems generally assume that integrity should hold in the presence of a threshold number of dishonest or faulty nodes (Byzantine fault tolerance). It is desirable for similar assumptions to hold for multiple credential issuers (threshold aggregability). Issuing credentials through smart contracts would be very useful. A smart contract could conditionally issue user credentials depending on the state of the blockchain, or attest some claim about a user operating through the contract—such as their identity, attributes, or even the balance of their wallet.

As Coconut is based on a threshold issuance signature scheme, that allows partial claims to be aggregated into a single credential,  it allows collections of authorities in charge of maintaining a blockchain, or a side chain based on a federated peg, to jointly issue selective disclosure credentials.

System Overview

Coconut is a fully featured selective disclosure credential system, supporting threshold credential issuance of public and private attributes, re-randomisation of credentials to support multiple unlikable revelations, and the ability to selectively disclose a subset of attributes. It is embedded into a smart contract library, that can be called from other contracts to issue credentials. The Coconut architecture is illustrated below. Any Coconut user may send a Coconut request command to a set of Coconut signing authorities; this command specifies a set of public or encrypted private attributes to be certified into the credential (1). Then, each authority answers with an issue command delivering a partial credentials (2). Any user can collect a threshold number of shares, aggregate them to form a consolidated credential, and re-randomise it (3). The use of the credential for authentication is however restricted to a user who knows the private attributes embedded in the credential—such as a private key. The user who owns the credentials can then execute the show protocol to selectively disclose attributes or statements about them (4). The showing protocol is publicly verifiable, and may be publicly recorded.

 

Implementation

We use Coconut to implement a generic smart contract library for Chainspace and one for Ethereum, performing public and private attribute issuing, aggregation, randomisation and selective disclosure. We evaluate their performance, and cost within those platforms. In addition, we design three applications using the Coconut contract library: a coin tumbler providing payment anonymity, a privacy preserving electronic petitions, and a proxy distribution system for a censorship resistance system. We implement and evaluate the first two former ones on the Chainspace platform, and provide a security and performance evaluation. We have released the Coconut white-paper, and the code is available as an open-source project on Github.

Performance

Coconut uses short and computationally efficient credentials, and efficient revelation of selected attributes and verification protocols. Each partial credentials and the consolidated credential is composed of exactly two group elements. The size of the credential remains constant, and the attribute showing and verification are O(1) in terms of both cryptographic computations and communication of cryptographic material – irrespective of the number of attributes or authorities/issuers. Our evaluation of the Coconut primitives shows very promising results. Verification takes about 10ms, while signing an attribute is 15 times faster. The latency is about 600 ms when the client aggregates partial credentials from 10 authorities distributed across the world.

Summary

Existing selective credential disclosure schemes do not provide the full set of desired properties needed to issue fully functional selective disclosure credentials without sacrificing desirable distributed trust assumptions. To fill this gap, we presented Coconut which enables selective disclosure credentials – an important privacy enhancing technology – to be embedded into modern transparent computation platforms. The paper includes an overview of the Coconut system, and the cryptographic primitives underlying Coconut; an implementation and evaluation of Coconut as a smart contract library in Chainspace and Ethereum, a sharded and a permissionless blockchain respectively; and three diverse and important application to anonymous payments, petitions and censorship resistance.

 

We have released the Coconut white-paper, and the code is available as an open-source project on GitHub.  We would be happy to receive your feedback, thoughts, and suggestions about Coconut via comments on this blog post.

The Coconut project is developed, and funded, in the context of the EU H2020 Decode project, the EPSRC Glass Houses project and the Alan Turing Institute.

A Critical Analysis of Genome Privacy Research

The relationship between genomics and privacy-enhancing technologies (PETs) has been an intense one for the better part of the last decade. Ever since Wang et al.’s paper, “Learning your identity and disease from research papers: Information leaks in genome wide association study”, received the PET Award in 2011, more and more research papers have appeared in leading conferences and journals. In fact, a new research community has steadily grown over the past few years, also thanks to several events, such as Dagstuhl Seminars, the iDash competition series, or the annual GenoPri workshop. As of December 2017, the community website genomeprivacy.org lists more than 200 scientific publications, and dozens of research groups and companies working on this topic.

dagstuhl
Participants of the 2015 Dagstuhl Seminar on Genome Privacy

Progress vs Privacy

The rise of genome privacy research does not come as a surprise to many. On the one hand, genomics has made tremendous progress over the past few years. Sequencing costs have dropped from millions of dollars to less than a thousand, which means that it will soon be possible to easily digitize the full genetic makeup of an individual and run complex genetic tests via computer algorithms. Also, researchers have been able to link more and more genetic features to predisposition of diseases (e.g., Alzheimer’s or diabetes), or to cure patients with rare genetic disorders. Overall, this progress is bringing us closer to a new era of “Precision Medicine”, where diagnosis and treatment can be tailored to individuals based on their genome and thus become cheaper and more effective. Ambitious initiatives, including in UK and in US, are already taking place with the goal of sequencing the genomes of millions of individuals in order to create bio-repositories and make them available for research purposes. At the same time, a private sector for direct-to-consumer genetic testing services is booming, with companies like 23andMe and AncestryDNA already having millions of customers.

23andme
Example of 23andme “Health Overview” test results. (Image from: https://www.singularityweblog.com/)

On the other hand, however, the very same progress also prompts serious ethical and privacy concerns. Genomic data contains highly sensitive information, such as predisposition to mental and physical diseases, as well as ethnic heritage. And it does not only contain information about the individual, but also about their relatives. Since many biological features are hereditary, access to genomic data of an individual essentially means access to that of close relatives as well. Moreover, genomic data is hard to anonymize: for instance, well-known results have demonstrated the feasibility of identifying people (down to their last name) who have participated in genetic research studies just by cross-referencing their genomic information with publicly available data.

Overall, there are a couple of privacy issues that are specific to genomic data, for instance its almost perpetual sensitivity. If someone gets ahold of your genome 30 years from now, that might be still as sensitive as today, e.g., for your children. Even if there may be no immediate risks from genomic data disclosure, things might change. New correlations between genetic features and phenotypical traits might be discovered, with potential effects on perceived suitability to certain jobs or on health insurance premiums. Or, in a nightmare scenario, racist and discriminatory ideologies might become more prominent and target certain groups of people based on their genetic ancestry.

Alt-right trolls are arguing over genetic tests they think “prove” their whiteness. (Image taken from Vice News)

Making Sense of PETs for Genome Privacy

Motivated by the need to reconcile privacy protection with progress in genomics, the research community has begun to experiment with the use of PETs for securely testing and studying the human genome. In our recent paper, Systematizing Genomic Privacy Research – A Critical Analysis, we take a step back. We set to evaluate research results using PETs in the context of genomics, introducing and executing a methodology to systematize work in the field, ultimately aiming to elicit the challenges and the obstacles that might hinder their real-life deployment.

Continue reading A Critical Analysis of Genome Privacy Research

Chainspace: A Sharded Smart Contracts Platform

Thanks to their resilience, integrity, and transparency properties, blockchains have gained much traction recently, with applications ranging from banking and energy sector to legal contracts and healthcare. Blockchains initially received attention as Bitcoin’s underlying technology. But for all its success as a popular cryptocurrency, Bitcoin suffers from scalability issues: with a current block size of 1MB and 10 minute inter-block interval, its throughput is capped at about 7 transactions per second, and a client that creates a transaction has to wait for about 10 minutes to confirm that it has been added to the blockchain. This is several orders of magnitude slower that what mainstream payment processing companies like Visa currently offer: transactions are confirmed within a few seconds, and have ahigh throughput of 2,000 transactions per second on average, peaking up to 56,000 transactions per second. A reparametrization of Bitcoin can somewhat assuage these issues, increasing throughput to to 27 transactions per second and 12 second latency. Smart contract platforms, such as Ethereum inherit those scalability limitations. More significant improvements, however, call for a fundamental redesign of the blockchain paradigm.

This week we published a pre-print of our new Chainspace system—a distributed ledger platform for high-integrity and transparent processing of transactions within a decentralized system. Chainspace uses smart contracts to offer extensibility, rather than catering to specific applications such as Bitcoin for a currency, or certificate transparency for certificate verification. Unlike Ethereum, Chainspace’s sharded architecture allows for a ledger linearly scalable since only the nodes concerned with the transaction have to process it. Our modest testbed of 60 cores achieves 350 transactions per second. In comparison, Bitcoin achieves a peak rate of less than 7 transactions per second for over 6k full nodes, and Ethereum currently processes 4 transactions per second (of a theoretical maximum of 25). Moreover, Chainspace is agnostic to the smart contract language, or identity infrastructure, and supports privacy features through modern zero-knowledge techniques. We have released the Chainspace whitepaper, and the code is available as an open-source project on GitHub.

System Overview

The figure above illustrates the system design of Chainspace. Chainspace is comprised of a network of infrastructure nodes that manage valid objects and ensure that only valid transactions on those objects are committed.  Let’s look at the data model of Chainspace first. An object represents a unit of data in the Chainspace system (e.g., a bank account), and is in one of the following three states: active (can be used by a transaction), locked (is being processed by an existing transaction), or inactive (was used by a previous transaction).  Objects also have a type that determines the unique identifier of the smart contract that defines them. Smart contract procedures can operate on active objects only, while inactive objects are retained just for the purposes of audit. Chainspace allows composition of smart contracts from different authors to provide ecosystem features. Each smart contract is associated with a checker to enable private processing of transactions on infrastructure nodes since checkers do not take any secret local parameters. Checkers are pure functions (i.e., deterministic, and have no side-effects) that return a boolean value.

Now, a valid transaction accepts active input objects along with other ancillary information, and generates output objects (e.g., transfers money to another bank account). To achieve high transaction throughput and low latency, Chainspace organizes nodes into shards that manage the state of objects, keep track of their validity, and record transactions aborted or committed. We implemented this using Sharded Byzantine Atomic Commit (S-BAC)—a protocol that composes existing Byzantine Fault Tolerant (BFT) agreement and atomic commit primitives in a novel way. Here is how the protocol works:

  • Intra-shard agreement. Within each shard, all honest nodes ensure that they consistently agree on accepting or rejecting a transaction.
  • Inter-shard agreement. Across shards, nodes must ensure that transactions are committed if all shards are willing to commit the transaction, and rejected (or aborted) if any shards decide to abort the transaction.

Consensus on committing (or aborting) transactions takes place in parallel across different shards. A nice property of S-BAC’s atomic commit protocol is that the entire shard—rather than a third party—acts as a coordinator. This is in contrast to other sharding-based systems with cryptocurrency application like OmniLedger or RSCoin where an untrusted client acts as the coordinator, and is incentivized to act honestly. Such incentives do not hold for a generalized platform like Chainspace where objects may have shared ownership.

Continue reading Chainspace: A Sharded Smart Contracts Platform

Creating scalable distributed ledgers for DECODE

Since the introduction of Bitcoin in 2008, blockchains have gone from a niche cryptographic novelty to a household name. Ethereum expanded the applicability of such technologies, beyond managing monetary value, to general computing with smart contracts. However, we have so far only scratched the surface of what can be done with such “Distributed Ledgers”.

The EU Horizon 2020 DECODE project aims to expand those technologies to support local economy initiatives, direct democracy, and decentralization of services, such as social networking, sharing economy, and discursive and participatory platforms. Today, these tend to be highly centralized in their architecture.

There is a fundamental contradiction between how modern services harness the work and resources of millions of users, and how they are technically implemented. The promise of the sharing economy is to coordinate people who want to provide resources with people who want to use them, for instance spare rooms in the case of Airbnb; rides in the case of Uber; spare couches of in the case of couchsurfing; and social interactions in the case of Facebook.

These services appear to be provided in a peer-to-peer, and disintermediated fashion. And, to some extent, they are less mediated at the application level thanks to their online nature. However, the technical underpinnings of those services are based on the extreme opposite design philosophy: all users technically mediate their interactions through a very centralized service, hosted on private data centres. The big internet service companies leverage their centralized position to extract value out of user or providers of services – becoming de facto monopolies in many case.

When it comes to privacy and security properties, those centralized services force users to trust them absolutely, and offer little on the way of transparency to even allow users to monitor the service practices to ground that trust. A recent example illustrating this problem was Uber, the ride sharing service, providing a different view to drivers and riders about the fare that was being paid for a ride – forcing drivers to compare what they receive with what riders pay to ensure they were getting a fair deal. Since Uber, like many other services, operate in a non-transparent manner, its functioning depends on users absolute to ensure fairness.

The lack of user control and transparency of modern online services goes beyond monetary and economic concerns. Recently, the Guardian has published the guidelines used by Facebook to moderate abusive or illegal user postings. While, moderation has a necessary social function, the exact boundaries of what constitutes abuse came into question: some forms of harms to children or holocaust denial were ignored, while material of artistic or political value has been suppressed.

Even more worryingly, the opaque algorithms being used to promote and propagate posts have been associated with creating a filter bubble effect, influencing elections, and dark adverts, only visible to particular users, are able to flout standards of fair political advertising. It is a fact of the 21st century that a key facet of the discursive process of democracy will take place on online social platforms. However, their centralized, opaque and advertising-driven form is incompatible with their function as a tool for democracy.

Finally, the revelations of Edward Snowden relating to mass surveillance, also illustrate how the technical centralization of services erodes privacy at an unprecedented scale. The NSA PRISM program coerced internet services to provide access to data on their services under a FISA warrant, not protecting the civil liberties of non-American persons. At the same time, the UPSTREAM program collected bulk information between data centres making all economic, social and political activities taking place on those services transparent to US authorities. While users struggle to understand how those services operate, governments (often foreign) have total visibility. This is a complete inversion of the principles of liberal democracy, where usually we would expect citizens to have their privacy protected, while those in position of authority and power are expected to be accountable.

The problems of accountability, transparency and privacy are social, but are also based on the fundamental centralized architecture underpinning those services. To address them, the DECODE project brings together technical, legal, social experts from academia, alongside partners from local government and industry. Together they are tasked to develop architectures that are compatible with the social values of transparency, user and community control, and privacy.

The role of UCL Computer Science, as a partner, is to provide technical options into two key technical areas: (1) the scalability of secure decentralized distributed ledgers that can support millions or billions of users while providing high-integrity and transparency to operations; (2) mechanisms for protecting user privacy despite the decentralized and transparent infrastructure. The latter may seem like an oxymoron: how can transparency and privacy be reconciled? However, thanks to advances in modern cryptography, it is possible to ensure that operations were correctly performed on a ledger, without divulging private user data – a family of techniques known as zero-knowledge.

I am particularly proud of the UCL team we have put together that is associated with this project, and strengthens considerably our existing expertise in distributed ledgers.

I will be leading and coordinating the work. I have a long standing interest, and track record, in privacy enhancing technologies and peer-to-peer computing, as well as scalable distributed ledgers – such as the RSCoin currency proposal. Shehar Bano, an expert on systems and networking, has joined us as a post-doctoral researcher after completing her thesis at Cambridge. Alberto Sonnino will be doing his thesis on distributed ledgers and privacy, as well as hardware and IoT applications related to ledgers, after completing his MSc in Information Security at UCL last year. Mustafa Al-Bassam, is also associated with the project and works on high-integrity and scalable ledger technologies, after completing his degree at Kings College London – he is funded by the Turing Institute to work on such technologies. Those join our wider team of UCL CS faculty, with research interests in distributed ledgers, including Sarah Meiklejohn, Nicolas Courtois and Tomaso Aste and their respective teams.

 

This post also appears on the DECODE project blog.

Top ten obstacles along distributed ledgers’ path to adoption

In January 2009, Bitcoin was released into the world by its pseudonymous founder, Satoshi Nakamoto. In the ensuing years, this cryptocurrency and its underlying technology, called the blockchain, have gone on a rollercoaster ride that few could have predicted at the time of its deployment. It’s been praised by governments around the world, and people have predicted that “the blockchain” will one day be like “the Internet.” It’s been banned by governments around the world, and people have declared it “adrift” and “dead.”

After years in which discussions focused entirely on Bitcoin, people began to realize the more abstract potential of the blockchain, and “next-generation” platforms such as Ethereum, Steem, and Zcash were launched. More established companies also realized the value in the more abstract properties of the blockchain — resilience, integrity, etc. — and repurposed it for their particular industries to create an even wider class of technologies called distributed ledgers, and to form industrial consortia such as R3 and Hyperledger. These more general distributed ledgers can look, to varying degrees, quite unlike blockchains, and have a somewhat clearer (or at least different) path to adoption given their association with established partners in industry.

Amidst many unknowns, what is increasingly clear is that, even if they might not end up quite like “the Internet,” distributed ledgers — in one form or another — are here to stay. Nevertheless, a long path remains from where we are now to widespread adoption and there are many important decisions to be made that will affect the security and usability of any final product. In what follows, we present the top ten obstacles along this path, and highlight in some cases both the problem and what we as a community can do (and have been doing) to address them. By necessity, many interesting aspects of distributed ledgers, both in terms of problems and solutions, have been omitted, and the focus is largely technical in nature.

10. Usability: why use distributed ledgers?

The problem, in short. What do end users actually want from distributed ledgers, if anything? In other words, distributed ledgers are being discussed as the solution to problems in many industries, but what is it that the full public verifiability (or accountability, immutability, etc.) of distributed ledgers really maps to in terms of what end users want?

9. Governance: who makes the rules?

The problem, in short. The beauty of distributed ledgers is that no one entity gets to control the decisions made by the network; in Bitcoin, e.g., coins are generated or transferred from one party to another only if a majority of the peers in the network agree on the validity of this action. While this process becomes threatened if any one peer becomes too powerful, there is a larger question looming over the operation of these decentralized networks: who gets to decide which actions are valid in the first place? The truth is that all these networks operate according to a defined set of rules, and that “who makes the rules matters at least as much as who enforces them.”

In this process of making the rules, even the most decentralized networks turn out to be heavily centralized, as recent issues in cryptocurrency governance demonstrate. These increasingly common collapses threaten to harm the value of these cryptocurrencies, and reveal the issues associated with ad-hoc forms of governance. Thus, the problem is not just that we don’t know how to govern these technologies, but that — somewhat ironically — we need more transparency around how these structures operate and who is responsible for which aspects of governance.

8. Meaningful comparisons: which is better?

The problem. Bitcoin was the first cryptocurrency to be based on the architecture we now refer to as the blockchain, but it certainly isn’t the last; there are now thousands of alternative cryptocurrencies out there, each with its own unique selling point. Ethereum offers a more expressive scripting language and maintains state, Litecoin allows for faster block creation than Bitcoin, and each new ICO (Initial Coin Offering) promises a shiny feature of its own. Looking beyond blockchains, there are numerous proposals for cryptocurrencies based on consensus protocols other than proof-of-work and proposals in non-currency-related settings, such as Certificate Transparency, R3 Corda, and Hyperledger Fabric, that still fit under the broad umbrella of distributed ledgers.

Continue reading Top ten obstacles along distributed ledgers’ path to adoption

Caveat emptor: Privacy could turn UK’s genomic dream into a nightmare

Raise your hand if, over the past couple of years, you have not heard of whole genome sequencing (usually abbreviated as WGS), or at least read a sensational headline or two about how fast its costs are dropping. In a nutshell, WGS is used to determine an organism’s complete DNA sequence. But it is actually not the only way to analyze our DNA — in fact, genetic testing has been used in clinical settings for decades, e.g., to diagnose patients with known genetic conditions. Seven-time Wimbledon champion Pete Sampras is a beta-thalassemia carrier – a condition that affects the formation of beta-globin chains, ultimately leading to red blood cells not being formed correctly. Testing for thalassemia, usually triggered by family history or a blood test showing low mean corpuscular volume, is done with a number of simple in-vitro techniques.

The availability of affordable whole genome sequencing not only prompts new hopes toward the discovery and diagnosis of rare/unknown genetic conditions, but also enables researchers to better understand the relationship between the genome and predisposition to diseases, response to treatment, etc. Overall, progress makes it increasingly feasible to envision a not-so-distant future where individuals will undergo sequencing once, making their digitized genome easily available for doctors, clinicians, and third-parties. This would also allow us to use computational algorithms to analyze the genome as a whole, as opposed to expensive, slower, targeted in-vitro tests.

Along these lines is last week’s announcement by Prof. Dame Sally Davies, UK’s Chief Medical Officer, calling the NHS to deliver her “genomic dream” within five years, with whole genome sequencing becoming “as standard as blood tests and biopsies.” As detailed in her annual report, a large number of patients in the UK already undergo genetic testing at least once in their life, and for a wide range of reasons, including the aforementioned thalassemia diagnosis, screening for cancer predisposition triggered by high family incidence, or determining the best course of action in cancer treatment. So wouldn’t it make sense to sequence the genome once and keep the data available for life? My answer is yes, but with a number of bold and double underlined caveats.

The first one is with respect to the security concerns prompted by the need to store data of extreme sensitivity like genomic data. The genome obviously contains information about ethnic heritage and predisposition to diseases/conditions, possibly including mental disorders. Data breaches of sensitive information, including health and medical data, sadly happen on a daily basis. But certain security threats are actually specific to genomic data and much more worrisome. For instance, due to its hereditary nature, access to a genome essentially implies access to that of close relatives as well, including offspring, so one’s decision to publish/donate their genome is also being made for their siblings, kids/grandkids, etc. So sensitivity does not degrade over time, but persists long after a patient’s death. In fact, it might even increase, as new aspects of the genome are studied and discovered. As a consequence, Prof. Dame Davies’ dream could easily turn into a nightmare without adequate investments toward sound security measures, that involve both technical tools (such as upgrading of obsolete hardware) as well as education, awareness, and practices that do not simply shift burden onto clinicians and practitioners, but incorporate security in their design and not as an after-the-fact.

Another concern is with allowing researchers to use the genomic data collected by the NHS, along with medical history, for research purposes – e.g., to discover genetic mutations that are responsible for certain traits or diseases. This requires building a meaningful trust relationship between the NHS/Government and patients, which cannot happen without healing the wounds from recent incidents like the care.data debacle or Google DeepMind’s use of personal NHS records. Instead, the annual report seems to include security/anonymity promises we cannot realistically maintain, while, worse yet, promoting a rhetoric of greater good trumping privacy concerns, as well as seemingly pushing a choice between donating data and access to the best care. It is misleading to use terms like “de-identification” of genomic data as an effective protection tool, while proper anonymization is inherently impossible due to its peculiar combination of unique and hereditary features, as demonstrated by a wide array of scientific results. Rather, we should make it clear that data can never be fully anonymized, or protected with 100% guarantees.

Overall, I believe that patients should not be automatically enrolled in sequencing programs. Even if they are given an option to later withdraw, once the data is out there it is impossible to delete all copies of it. Rather, patients should voluntarily decide to join through an effective informed consent mechanism. This proves to be challenging against a background in which information that can be extracted/inferred from genomes may rapidly change: what if in the future a new mutation responsible for early on-set Alzheimer’s is discovered? What if the NHS is privatized? Encouraging results with respect to education and informed consent, however, do exist. For instance, the Personal Genome Project is a good example of effective strategies to help volunteers understand the risks and could be used to inform future NHS-run sequencing programs.

 

An edited version of this article was originally published on the BMJ.