2022 USENIX Conference on Privacy Engineering Practice and Respect Key Takeaways.

Sep 12, 2022

The 2022 USENIX Conference on Privacy Engineering Practice and Respect (PEPR '22) focuses on designing and building products and systems that respect their users' privacy and the societies in which they operate.

The talks at PEPR’22 gave encouragement that a new "privacy protection industry" is emerging, with new privacy technologies being developed and discoveries being made to assist users in gaining more control and transparency over their personal data. Such advancements unquestionably reflect a step toward making privacy a "human-value-focused" need rather than a "compliance-only" mandate.

Some of the privacy topics that were covered in the conference included differential privacy, privacy threat modeling, consent management, effective privacy labeling and standardization, privacy incident management, privacy in different areas (e.g., smart home devices, browsers, AI chatbots), privacy for vulnerable groups (e.g., children and non-binary gender), privacy by design and so on.

Here are the key takeaways from the conference.

Differential Privacy

In Practice

Differential privacy (DP) is an increasingly popular tool for preserving individuals’ privacy by adding statistical uncertainty when sharing sensitive data. However, DP has practical challenges that include the need for iterative exploration and negotiation with data custodians and analysts on the choice of privacy parameters (e.g., epsilon and delta), DP variants (different algorithms/mechanisms e.g., Gaussian and Laplacian mechanisms), and data statistics (e.g., mean, variance and sum).

Further, there is tension between data minimization (limiting data collection) and differential privacy: differential privacy gives nearly accurate results on large datasets, however, collecting a large amount of data can infringe the privacy of individuals and transgress regulations.

Contextual Integrity

Contextual Integrity (CI) [1], is another aspect that was introduced, to help with these decisional choices on applying DP. CI captures socially nuanced requirements of privacy. To capture such nuances, it comes with a template for understanding information norms in a social context (e.g., education and healthcare), parameterized in terms of the sender, receiver, data subject of the personal information, sensitive attributes of the data subject, and transmission principle (the normative rule governing the conditions under which this parameterized information flow is (in)appropriate).

Libraries

A large number of DP libraries have been developed and open-sourced with the primary goal of assisting non-experts in applying DP. However, there is a dearth of libraries focusing on scalability. The two libraries introduced – Tumult’s Analytics/Core by Tumult Labs [2], and PipelineDP [3] by Google and OpenMinded – address this problem. The former is built on the framework of the OpenDP library [4] by Harvard's OpenDP Team and uses Apache Spark for scalability. The latter library employs Apache Spark and Apache Beam to process massive datasets.

Gender Privacy

Applications asking for users’ gender predominantly follow the norm of giving binary options. Technologists must recognize the complexity and dynamic nature of gender, as well as the disclosure concerns, and must be purpose limited (do you really need to collect gender?). These realizations can contribute to technological advancements that respect diverse social groups. Furthermore, the usage of gendered honorifics, salutations, and titles is an implicit manner of collecting gender (credit: an attendee sharing). This further underscores considering the cultural aspects of gender collection.

Privacy in AI Chatbots

The disconcerting empirical findings revealed that chatbot users can share highly sensitive information such as political opinions, sexual orientations, religious beliefs, and health data. Some promising solutions for developers were presented to address the potential privacy harm to users. These included identifying mental health and vulnerable populations, adjusting the interaction mode or interventions, safeguarding inadvertent sharing of sensitive information, and avoiding manipulative/dark patterns -- design techniques that nudge users to less privacy-protective options.

These excerpts from the Cleverbot's [5] privacy policy give me chills (credit: an attendee sharing) -
“It also means that content is by default not deleted from our servers - ever - and may still be accessible after your account is deleted.”
"by sending input to Cleverbot you agree that we may process data for or about you"

Browser Privacy

DNS Privacy

Domain Name System (DNS) is the Internet's directory that resolves domain names. Anyone on the network path between one's device and the DNS resolver can see both the query containing the desired hostname (or website) and the IP address identifying one's device. This browsing data has been monetized (for example, by ISPs) for purposes such as ad targeting.

DNS over TLS (DoT) and DNS over HTTPS (DoH) are often used to encrypt DNS queries to prevent visibility during transmission. The key distinction between the two is the port used. DoT exclusively uses port 853, whereas DoH uses port 443, which is also used by all other HTTPS traffic. DoH is better than DoT in terms of privacy: DoH queries are hidden in regular HTTPS traffic, giving network administrators less visibility while providing users with more privacy, whereas with DoT network administrators can monitor the data.

The two techniques have two major drawbacks: centralized DNS presents single points of failure, and the resolver may still link all queries to client IP addresses. Oblivious DNS over HTTPS (ODoH) solves these issues by adding a layer of public key encryption as well as a network proxy between clients and DoH servers. The combination of these two added elements guarantees that only the user has access to both the DNS messages and their own IP address at the same time.

It's important to acknowledge that DNS privacy, through these protocols, can negatively impact the public interest such as abuse detection, nefarious content censorship, internet shutdowns, accessibility by third parties or user agents and so on. [6]

Content Blocking (a.k.a adblocking or tracker blocking)

37% of the web uses ad blockers who value choice and control over their browsing experience. The usage of community-maintained pre-defined filter lists (a collection of rules that describe what things to block on which websites, e.g., EasyList and EasyPrivacy) can become obsolete and result in functionality breakage across the web. An alternative technique of "sugarcoating", on the other hand, replaces the original tracking script with a privacy-preserving one that mitigates the aforementioned problems.

SugarCoat: Programmatically generating privacy-preserving, web-compatible resource replacements for content blocking [7].

Smart Home Privacy

IoT Inspector

Smart home devices (internet-connected consumer devices) have conventional challenges including privacy (e.g, sensitive information transmission to third parties), security (e.g, distributed denial of service attacks ) and device inventory and management (e.g, determining what devices are connected to a network).

To address these challenges, IoT Inspector [8], an open-source IoT traffic monitoring tool (via ARP spoofing) for smart homes was introduced. It helps smart home users visualize the network activities of smart home devices. Also, the large-scale crowdsourced smart home network traffic data and device labels from 63,000+ devices across 6400+ users with IoT Inspector can enable smart home data-driven research.

Bystanders' Privacy

Smart home users are defined as those who own smart home devices in their home and smart home bystanders as those who do not own smart home devices but may be subject to the data collection by smart home devices.

How to deliver privacy-related notifications to users and bystanders and increase their awareness of data practices. To address this problem four privacy awareness mechanisms were investigated [9] to learn users’ and bystanders’ preferences. The mechanisms included privacy dashboard, data app, ambient light (visual cues to signal network traffic, e.g., green light for encrypted traffic, red light for tracking traffic and yellow light for regular traffic), and privacy speaker (audio cues to signal network traffic e.g., pleasant chime sound for encrypted traffic, scary buzz sound for tracking traffic, single beep sound for regular traffic). Each of the mechanisms has its their pros and cons in terms of controls, details, ease of use, security concerns, intrusion, psychological burden, understandability, annoyance, and so on.

Privacy Testbed

Privacy is simply too complicated to be left in the hands of the average consumer (including developers). A remedy is to regulate the infrastructure that collects, stores and transfers data. The proposed privacy testbed [10] can analyze and understand the privacy behavior of client-server applications in a network environment across a large number of hosts in a systematic manner. Using the testbed, developers can instantiate multiple virtual devices with various versions of operating systems to facilitate executing privacy-related analyses. Such a testbed can be utilized by regulators for certification and verification, while consumers can validate application claims regarding their data practices.

Privacy By Design

Privacy by design states that privacy (including security) should be proactively incorporated into the software development lifecycle as opposed to being considered an afterthought.

Successful incorporation of privacy by design can happen when developers consider privacy as a feature rather than a bottleneck and raise questions - “Wait! This doesn’t seem right! We need to discuss it with the team.”. Another moralistic point of view to take into account is that, rather than emphasizing the business value of privacy (which is good to start with to incentivize the developers), organizations should instead focus on the human value, acknowledging that behind every piece of data is a human.

Some of the aspects of privacy by design covered in the talks are as below:

Personal Data Annotation

Personal data entering an organization can spread like wildfire throughout datasets and storage systems, rendering data lineage untraceable and potentially exposing an organization to privacy and security breaches. Therefore it’s paramount to annotate the data and help developers realize the sensitivity level of the data they are handling.

The Twitter team came up with highly granular 504 labels for automated annotation of the columns of a dataset. They designed Annotation Recommendation Service [11] using Elasticsearch with a neural network calibration model. The components included:

Data Collection Engine (collects existing annotations from already annotated data and dataset metadata)
Recommendation Engine (suggests annotations using the data collected by Data Collection Engine)
ML Refresh Pipeline (regularly trains and evaluates the performance of the recommendation models based on the new data from the Data Collection Engine)
Annotation Service APIs (help interaction of other services and internal tools with the Annotation Recommendation Service)
Integrations (internal tools integrated with the Annotation Recommendation Service; helps in triaging the false positives and false negatives by including a human element for review)

Automation through machine learning supported the quality and velocity of data annotation. This approach also helped in data discovery (optimizing storage and discovering usage patterns) and data auditing and handling (detecting sensitivity of the accessed data and applying the necessary access control measures).

Product Deprecation

Users expect the data to be deleted after a product or feature(s) is deprecated. And organizations have a tendency to keep a product’s dead/unused data and code forever. Meta team addressed safe, rigorous, and automated deletion of data and code, leading to privacy win – data retention policy violation risk mitigation and supporting the privacy team by reducing the surface area of attack.

They presented two approaches for automated, multi-language, and scalable data and code deletion that combines the static usage measurements (code querying the data), runtime usage (traffic in production), and code-specific knowledge (including the business perspective). Also, the code dependency across multiple languages is indexed using Glean [12], which presents a static analysis of information in a readily accessible format. They also built a workflow management tool to help engineers deprecate a product or feature safely, efficiently, and completely. Building such an internal tool not only helps in educating engineers but also in building trust.

Right to Erasure/Export Request

Users have the "right to know" (export request) about the personal information a business collects about them and the "right to delete" (erasure request) the personal information gathered from them, under the California Consumer Privacy Act of 2018.

The team at Lyft presented a strategically designed federated architecture to handle orchestrated data erasure and export requests honoring users’ rights. This thoughtful system development to comply with regulations and "shifting privacy to the left," as well as the involvement of stakeholders from privacy, security, engineering, product, infrastructure, and legal, exemplifies how privacy excellence can be brought horizontally across all lines of business company-wide by pulling together a cross-functional team.

Consent

Cookie

One of the biggest technological assumptions is that users understand the browser cookies, let alone make an informed cookie consent decision and realize the privacy harms. Moreover, cookie consent interfaces fail to comply with regulatory requirements and are often fraught with misleading labels, dark patterns, and bad practices such as detailed paragraphs with limited options, burying the necessary options in links, and positioning the cookie banners where users can be ignorant.

Standardization of best practices for cookie consent interfaces is required, including (based on user research [13]) simplifying rejecting all cookies and listing cookie categories in a user-friendly manner (most users misunderstand functional cookies as the necessary cookies for the website functioning, rather they are the tracking cookies). Finally, research is needed to develop creative solutions (that might require a legal mandate) to automate cookie consent and lessen the burden on users.

Ethical Verbal Consent for Voice Assistants

Typically, voice assistants come with a companion app that collects user consent. Alternatively, having verbal-forward consent can organically integrate into the conversation, although this strategy has drawbacks. This includes a lack of information (key details may be missing), a lack of audible distinction (a subtle transition from skill to OS), a break in interface symmetry (the user may have to direct to a companion app to withdraw consent), and time pressure (bound on the response timing). Furthermore, consumers are often unable to distinguish between first- and third-party skills.

Privacy Labels

The onus is on the organizations to manage the security and privacy risk on behalf of their application users who come with a spectrum of technical knowledge. It is crucial to help the users in informed privacy decision-making with effective privacy labels in an application. These effective labels, supported by a multidisciplinarity fields, include:

Reducing information asymmetry (when buyers can’t distinguish between high and low-quality products also known as the Market of Lemons [14]) – provide users with relevant information but can increase cognitive load, for example, lengthy (yet didactic) and high literacy requiring privacy policies
Minimizing cognitive burden – instantly convey privacy offered by a product and enable quick comparisons between the products
Addressing psychological bias - positive framing of privacy gain is more effective than negative framing of privacy loss for users’ privacy decisions; users are less likely to give up privacy for monetary compensation while also being less likely to pay more to protect privacy (known as endowment effect – value attribute to a good varies depending on whether or not the person possesses it); default privacy settings are highly impactful.
Personalization - privacy preference varies with respect to expertise and preference; machine learning models can help, for instance, a user’s comfort with data collection can be predicted by using features such as benefit, information sensitivity, trust, risk, and device type.

Modular labeling, which uses a layered method to deliver information (e.g., first providing aggregates, then data collection and usage practices; the latter should not be displayed by default to decrease cognitive burden), reduces cognitive burden and information asymmetry.

Data Privacy Vocabulary

Standardization in privacy is required (e.g., there is no consensus on the understanding of personal data, de-identification, and anonymization across jurisdictions) to facilitate personal data protection globally, including across jurisdictions (e.g., GDPR and CCPA), sectors (e.g., banking, healthcare), and use cases.

The Data Privacy Vocabulary [15] is a step towards the required standardization (DPV). DPV is a base vocabulary and ontology for encapsulating the knowledge and semantics associated with personal data processing, entities involved and their roles, technology details, relationships to laws and legal justifications for its use, and other relevant concepts based on privacy and data protection. DPV can be used to frame privacy policies, access controls, use case documentation across sectors, internet communication, and so on, bringing commonality across areas and applications.

The difficulties in developing DPV include enormous disparities in worldwide privacy notions, integration of complex technologies (e.g., cookies), adoption obstacles, and so on. To refine the concepts, a multidisciplinary community effort is required.

DPV is a more comprehensive version of the data privacy taxonomy provided by Fides Language [16].

Privacy Threat Modeling

Primer for Privacy Threat Modelling

Solove's privacy taxonomy [17] : Classification of privacy harms based on privacy harms caused due to information collection (surveillance and interrogation), information processing (aggregation, identification, insecurity, secondary use, and exclusion), invasions (intrusion and decisional interference), and information dissemination (breach of confidentiality, disclosure, exposure, increased accessibility, blackmail, appropriation, and distortion)

Privacy vulnerabilities: Flaws that can be exploited to cause threat that causes privacy harms

Privacy consequences: Harms that come to individuals on exploiting the vulnerabilities

Privacy threats: Privacy actions and inactions that exploit privacy vulnerabilities

Privacy attacks: Actions or inactions that cause perceived privacy harm (based on Solove's taxonomy [17]), that do not solely involve cybersecurity violations

LINDDUN [18]: High-level privacy threat model that focuses on system flaws rather than the exploitation of those flaws. The privacy threat categories include Linkability, Identifiability, Non-repudiation, Detectability, Disclosure of Information, Unawareness, and Non-compliance.

While there are robust regulations on privacy compliance but no mandate on privacy risk management. Privacy risk mitigation can be addressed by making privacy threat modeling informed (targeting most likely threats).

Using a dataset of 150 privacy attacks, the MITRE team designed two prototypes for privacy threat modeling (inspired by ubiquitous cybersecurity threat modeling) – privacy threat taxonomy (hierarchical), and privacy threat clusters (clustered). Both models break attacks into threat actions (e.g, insufficient de-identification, intrude, sniffing, and fingerprint). The two prototypes differ in that the privacy threat taxonomy further groups threat actions into domains of similar activities (e.g., identification, insecurity, aggregation, sharing, notice, and consent), while privacy threat clusters further group similar threat actions into overlapping threat clusters (e.g., physical monitoring, manipulation, deriving information and extortion). Privacy threat patterns are created by taking a privacy threat cluster and mapping it to the privacy threat taxonomy, and then assembling it in a linear kill-chain model (to identify possible attacks).

This approach can aid in modeling privacy risks as a function of privacy threats, privacy consequences, and privacy vulnerabilities, and help organizations in taking the necessary risk mitigation efforts.

Real-life Privacy Harms

Washington Health Identification - The dataset containing demographic information was re-identified with a 43% success rate, by purchasing and linking it to an auxiliary dataset, resulting in "identification" privacy harm.

Lenovo Superfish: Lenonvo's preinstalled software intercepted a user's internet traffic for ad targeting with intrusive pop-ups, resulting in "intrusion" privacy harm.

Cambridge Analytics: Data from a personality quiz was appropriated and used to target political ads, which amounted to decisional interference by manipulating user political decisions and producing chilling effects, resulting in "appropriation" and "decisional interference" privacy harms.

What's next?

This year's conference covered a wide range of privacy topics. Following that, PEPR'23 can focus on practical difficulties and discoveries in various other Privacy Enhancing Technologies (differential privacy was a hot topic this year), such as homomorphic encryption, federated analytics, secure multi-party computation, and so on. In addition, we must also look forward to the participation of organizations and academia from all around the world, not limited to Europe and the United States, because a global perspective on privacy is crucial.

References

https://digifesto.com/2021/10/08/towards-a-synthesis-of-differential-privacy-and-contextual-integrity/
https://gitlab.com/tumult-labs/core
https://pipelinedp.io/
Gaboardi, Marco, Michael Hay, and Salil Vadhan. "A programming framework for opendp." Manuscript, May (2020).
https://www.cleverbot.com/
https://github.com/mallory/DNS-Privacy/blob/main/DNS%20Privacy%20Vs..md
Smith, Michael, et al. "SugarCoat: Programmatically Generating Privacy-Preserving, Web-Compatible Resource Replacements for Content Blocking." Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021.
Huang, Danny Yuxing, et al. "Iot inspector: Crowdsourcing labeled network traffic from smart home devices at scale." Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4.2 (2020): 1-21.
Thakkar, Parth Kirankumar, et al. "“It would probably turn into a social faux-pas”: Users’ and Bystanders’ Preferences of Privacy Awareness Mechanisms in Smart Homes." CHI Conference on Human Factors in Computing Systems. 2022.
Gardiner, Joseph, et al. "Building a privacy testbed: Use cases and design considerations." European Symposium on Research in Computer Security. Springer, Cham, 2021.
https://blog.twitter.com/engineering/en_us/topics/insights/2021/fusing-elasticsearch-with-neural-networks-to-identify-data
https://glean.software/
Habib, Hana, et al. "“Okay, whatever”: An Evaluation of Cookie Consent Interfaces." CHI Conference on Human Factors in Computing Systems. 2022.
https://en.wikipedia.org/wiki/The_Market_for_Lemons
https://w3c.github.io/dpv/dpv/
https://ethyca.github.io/fideslang/
https://wiki.openrightsgroup.org/wiki/A_Taxonomy_of_Privacy
https://www.linddun.org/