DEV Community

Joost Visser
Joost Visser

Posted on • Originally published at towardsdatascience.com on

Why do ML engineers struggle to build trustworthy ML applications?

On the gap between theory and practice and how to close it

By Alex Serban and Joost Visser

While research on the Ethics of AI increased significantly in recent years, practitioners tend to neglect the risks coming from inequity or unfairness of their ML components. This is one of the insights of the annual State of AI report for 2020 recently published by the Human-Centered Artificial Intelligence group from Stanford University.

Why does this gap exist between theory and practice, and how can it be closed?

In our own research (SE4ML), we have approached this question from several angles, specifically for the major AI sub-area of machine learning (ML).

On the one hand, we studied ML engineering practices and their level of adoption among ML engineering teams. On the other hand, we studied the requirements for trustworthy ML formulated by policy makers. Finally, we investigated the relative importance of a range of decision drivers for architects of ML systems.

As we discuss in more detail below, our overall conclusion is that ML engineering teams lack concrete courses of action for designing and building their systems specifically to satisfy trustworthiness requirements, such as security, robustness, and fairness.

Also, we can draw the conclusion that practitioners are aware of the ethical and robustness risks raised by improper development of ML components, but are currently constrained by more traditional concerns such as scaling their systems, or maintaining performance between training and testing.

What are ML engineering practices?

ML engineering is a discipline concerned with developing engineering principles for the design, development, operation and maintenance of software systems with ML components.

In order to facilitate the adoption of engineering principles by practitioners, we have systematically distilled actionable engineering practices from interviews with practitioners and a broad review of practitioner blogs and academic literature. The resulting practices descriptions are catalogued on our website.

Moreover, we measure and publish the adoption rate of these practices and their perceived effects , in order to allow assessment and comparison of teams of engineers building software with ML components.

If you are member of a team that builds software systems with ML components, you can contribute to our research by taking our 10-minute survey on ML engineering practices.

Trustworthy ML

Following recent interest by policy makers to address improper use of machine learning (ML), we made efforts to extend our catalogue of practices to include operational practices for trustworthy ML.

In particular, we followed the requirements for trustworthy ML developed by the High level expert group on AI from the European Commission, which defines trustworthy AI as being lawful, ethical and robust, and defines seven key requirements for trustworthy AI.

Inspired by these requirements, we searched the literature for engineering practices that can be directly applied by developers to address ethical and robust development of ML components. In total, we identified 14 new practices, which include topics such as testing for bias, assuring security, or having the application audited by third parties.

For all new practices, we summarized the related work into a body of knowledge that follows the same structure as the previous practices from our catalogue, including detailed description, concise statement of intent, motivation, applicability, related practices, and references.

Adoption of practices for trustworthy ML remains low. These 14 practices for trustworthy ML are defined in detail in our ML engineering practices catalog.

We also extended our survey to measure the adoption of trustworthiness practices by teams developing ML solutions. Unfortunately, we found out that the practices for trustworthy ML have relatively low adoption (as can be seen in the figure above).

In particular, practices related to assuring security of ML components have the lowest adoption. The contributing factors to these results are diverse. For example, most defenses against adversarial examples — a known threat to ML components — have been breached. Moreover, research into data poisoning attacks shows that only a small percentage of training data needs to be altered in order to induce malicious behaviors. We believe the lack of “off the shelf” solution to security issues of ML components is the largest contributor factor to our results.

Architecture of ML systems

In a parallel study, in which we investigated how practitioners define the software architecture of systems with ML components, we asked practitioners to rank the most important decision drivers for their systems.

Once again, we found out that decision drivers related to trustworthy ML, such as Security, Privacy, or Robustness, were considered less important than more traditional drivers such as Performance or Scalability (as can be seen in the figure below).

Architectural decision drivers for systems using ML components. Drivers related to trustworthiness tend to be considered less important.

These results entail that practitioners are still working on solving basic concerns for developing and operationalizing the ML components, and tend to neglect the importance of trustworthy ML requirements.

Outlook

We believe that consistent efforts to address ethics and robustness through the eyes of software engineering will enhance the ability of practitioners to prioritize these requirements and develop trustworthy ML components.

Moreover, we believe that the adoption of trustworthiness-specific and general ML engineering practices is interconnected. For instance, the practice of continuous integration can make the practices for bias testing more effective.

Therefore, we are working on defining sets of traditional engineering and trustworthiness related practices that can be directly applied by practitioners to develop more ethical and robust ML components.

Learn more about the practices for trustworthy ML by reading our publication.

Alex Serban is PhD candidate at Radboud University and guest at the Leiden Institute of Advanced Computer Science.

Joost Visser is professor of Software and Data Science at Leiden University.

In the SE4ML project, we investigate how software engineering principles and practices need to be adapted or complemented for software systems that incorporate ML components.

If you are member of a team that builds software systems with ML components, you can contribute to our research by taking our 10-minute survey on ML engineering practices.


Top comments (0)