Home health remedies What are the ingredients of a successful Open Source cheminformatics software?

What are the ingredients of a successful Open Source cheminformatics software?

14
0
SHARE

Posted on November 30th, 2020 by in Chemistry

(Written by Elena Herzog in
collaboration with Markus Fischer, Gerd Blanke, Jarek Tomczac and Gabrielle
Whittick)

RDKit, a collection of cheminformatics and machine learning software, is assisting in solving chemical information challenges. The founder and creator of RDKit, Greg Landrum, was interviewed by the UDM (Unified Data Model) team, facilitated by Elsevier, to share his experience on what the road to success looks like and what ingredients does an open source project need to have to be successful. The learnings from the interview would help to shape the future of the UDM project, which is transferring from its consortium-led Pistoia Alliance model to a community-led model.

How it all began?

Greg is a chemist. After his PostDoc in Germany,
he moved to California and joined a couple of start-ups. Eventually he started a
small computational chemistry start-up providing consulting and machine
learning services. Open Source in chemistry was limited back in 2000 and the absence
of good alternatives sparked the creation of the RDKit. The open source oelib
(which eventually became OpenBabel) did not have a licence they could use and
attempts to license the commercial Daylight toolkit were unsuccessful. So, they
started writing code themselves and adding, little by little, new pieces. The
company was eventually shut down in 2006 and, rather than seeking to find a
purchaser for the technology, they decided to open source the code. Greg joined
the CADD group at Novartis in Basel and was able to set up a process allowing
him to continue to work on the open-source RDKit while at a large pharma. In
2011, the development ramped up even more when he moved to the Research. Requirements
for extensions were funded internally or Novartis was funding external
programmers to work on RDKit. “Working with the other scientists at Novartis really
helped inform the direction we took with the RDKit,” said Greg. In 2016, Greg
left Novartis for KNIME, the company behind the OS data analysis platform—and,
at the same time, started a small consultancy company, T5 Informatics, which
supports custom development services around RDKit. It is a combination of RDKit
as OS software and T5 Informatics that allowed Greg to do what he enjoyed most and
to spend his time on developing and extending functionality together with a
bunch of people with similar interests.

What does the RDKit community look like?

“The heart of any successful open source
project is its community,” says Greg. The insights are not easy to get though,
it is just the way the OS project is run. Nobody is asking anybody who they are
and where they come from. Some ideas in the community come from the RDKit UGMs
(User Group Meetings), and the last virtual (due to Covid-19) UGM, in October
2020, registered more than 500 participants, the highest ever recorded during
the RDKit UGM’s lifetime of 9 years. Registrants who replied to a Google survey
came from industry (52%), academia (40%), and government, laboratories and
non-profit research organisations (8%). The industry people were 70% pharma and
biotech, and 20% software. Hardly surprising based on the features provided by
RDKit. The UGMs are heavily European focused, but there is a large number of
users in the US, Japan and China. There was going to be a Japanese UGM this year,
but it was cancelled because of the Covid-19 situation.

How do people contribute to RDKit and why do
they contribute?

Greg defines contribution in its broad sense,
for instance:

  • Code, for sure
  • High quality bug report is
    considered very valuable
  • Good documentation is very valuable and
    incredibly helpful
  • Participation in answering questions,
    commenting and discussing issues

The rdkit-discuss mailing list is the primary
communication method for the RDKit community; people also use it as a Q&A
platform. It is hard to determine why people decide to answer emails.If it is
about a specific feature, often developers answer emails, but again, there is
no real mechanism to make people contribute unless they want to contribute. From
time to time, some “wrong” answers show up, but proficiency and comfort come
with experience. The majority of users have a problem to solve and want to
understand and seek people who might work on a similar problem. Some people may
feel an obligation: ‘I am using it, why should I not contribute?’ For some,
this is a recognition; active people are recognized in the community. It also
seems that if there is a code attached to a publication, researchers are more
inclined to use it. This increases citation, and this is what is important for
the publication and the author. Greg believes that there are data supporting
this, but he was not 100% sure. Another “selfish” motive for why people want to
contribute to OS projects is to be able to carry on working on it in the future,
even if people leave or change employers. Whatever the reasons might be, the
important thing is that the RDKit community is friendly and open; people feel
good about the project and all of these, surely, help with adoption.

How do companies contribute to RDKit?

Many companies have contributed to the
development and extensions of RDKit by either funding developers internally or hiring
external developers. Companies that participate have an easy way to attract
people with RDKit expertise. For instance, many students work on OS Software,
and employers understand what exactly developers do and how do they do it. Examples
of companies using the RDKit internally and contributing to it include
Schroedinger, Cresset, Novartis, Roche, Medchemica, Relay Therapeutics and NextMove
Software. Many other companies are using RDKit. For example, Elsevier is providing
and supporting it on Entellect’s Reaction Workbench, PerkinElmer is using it in
Spotfire, and one can use chemistry extensions based on the RDKit in Mathematica.
Google runs “Summer of code,” where projects improving and contributing to
RDKit tools are included. These important use cases increase adoption and
acceptance of RDKit.

What are the benefits for companies to deposit
the code to RDKit?

There is a very important point and, in fact,
there are many good reasons why companies choose to deposit the code to RDKit.

  • Testing and validation of code
    become easier as the pool of testers is theoretically unlimited
  • If a company decides that a piece of
    code is not IP critical, the code can be supported by community and somebody from
    the community might fix bugs
  • Developers and cheminformaticians with
    RDKit expertise are known to the companies, which follow and contribute to the
    development. The developers can be quickly mobilized to work on features that companies
    are interested in
  • The UGMs circulate lists of open
    positions advertised by companies, and this year there was a channel in Discord
    to announce open positions. Companies can post openings on the mailing list or
    LinkedIn group. In addition, a conversation has started on how to fund developers
    on a contract basis and, as mentioned previously, there is no organization to
    accept funding for RDKit

What governance structure does RDKit have and who
decides on what?

The Python community refers to Guido van Rossum,
the creator of the language, as “Benevolent dictator for life” (or BDFL). The
RDKit currently follows more or less this model. There is not much of a
governance structure, however there are four core maintainers and any
contributions are reviewed by at least two of them. Theoretically, two
developers must sign off and one of them should be Greg. He mentions that this
may not be the best way in a long term, but it is how it is. There are not many
decisions that they need to make, most of the decisions are tactical and each
decides what they want to work on. There is a broad list of interests they want
to work on—some are driven by long term and some by companies’ requests. Three
of the other developers are from Schrodinger, Novartis and Relay.

Under what licence does RDKit operate?

“OS licences are extremely important and
contentious,” Greg points out. RDKit uses the BSD licence. The BSD licence is very
permissive and allows commercial use; it is done by intention. The code is
covered by copyright. By default, the copyright material cannot be re-used,
however the licence allows usage and re-distribution of the code. On top of each
RDKit’s code, there is a copyright statement and the authors who have
contributed the code are shown. At the bottom of each file, it states: all
rights reserved, and covered by the licence
. One can follow the licence to
check what is allowed and what is not. For example, you cannot take out the
code completely, remove the copyrights and re-publish. The licence also
includes a clause disclaiming liability. Greg recommends using standard licences
for OSS, as many big companies are familiar with them and, hence, more willing
to use the OS software. To be clear, companies can build on the RDKit code and
commercialize it. Schrodinger and Cresset use RDKit in computational chemistry
code. RDKit is intended to be used in computational software; the companies do
not need to communicate anything to Greg or the RDKit community. Moreover,
there are filed patents that use RDKit. For example, there are 168 results in Google
patent search where RDKit is used as of October 2020.  

Are there any IP rights or copyrights when
people contribute to RDKit?

Apparently, this might be tricky in some cases.
Some OSS projects want to cover everything under one copyright. To accept the
code, copyright must be assigned. The RDKit does not do this. As RDKit is not
an organization, it cannot ask people to assign the copyright to it. Contributors
(and their employers) determine the copyright on pieces of contributed code. However,
all contributions must be covered by the same BSD licence as the rest of the
RDKit.

Does RDKit accept funding for specific projects?

Because RDKit does not have any legal
organization, it cannot accept funding. There are consultants you can pay, but
there is no central place to pay to do the development work. Contributing companies
provide funding to their programmers or to the external programmers to work on the
RDKit development and extensions. For example, Novartis has done both, paid T5
Informatics and had internal developers to contribute to RDKit. T5 Informatics,
in turn, being a consulting company, could process funding for RDKit if needed.
To the extent that Greg could focus on RDKit development, the RDKit have
benefited from it. When asked about crowdsourcing, Greg mentioned a success
story when Andrew Dalke managed to raise funding for the development of MMPDB.
However, it is questionable how successful future projects can be with regards
to raising money from interested individuals. The cheminformatics space is confined,
as the number of companies that would be interested in sponsoring the RDKit
development outside of commercial interests is limited. How to fund a bunch of
interesting projects which are not urgent or exposed enough is still occupying
the creator’s mind.

How does Greg see the future of RDKit?

Greg feels it is a long game and the hope is
that the toolkit continues evolving. Adoption and usage expansion in research
IT organisations such as Elsevier and pharma are extremely important and would bring
positive effects. In addition, more integration of the software in the internal
workflows at commercial companies in a more systematic way would increase adoption
and expand community.

Is there any value for RDKit to work closely together
with UDM?

The UDM is primarily the exchange standard, and
not a software; it is more of an open documentation project and less OSS,
unless there is an idea to build a software that does something around UDM.
Open documentation projects might use different licences (for example, creative
commons licences). It is difficult to say what
the right model for UDM can be, but having it as an OSS Project under the umbrella
of a standards organization such as IUPAC is a good idea. If UDM is successful, reader and
writer could be handy; having the code and being able to do something with UDM
files is valuable and useful and might speed up the adoption.

Closing considerations

The RDKit is used to process, harmonize, enhance
and analyse chemical data. The demand for a software that can assist in making
your data, for example, AI/ML ready as well as chemists who have skills and
knowledge to execute these tasks has increased. Elsevier, with its high-quality
chemical and biological data, often processes these data for various modelling projects,
such as AL/ML based synthesis and pharmacological modelling predictions. As
such, it is well positioned to support OS projects and chemical standards also
because its customers are increasingly using and embedding these tools and
standards in their ecosystems. The interview with Greg Landrum is a
confirmation of Elsevier’s interests in working together and helping
researchers and healthcare professionals advance science and improve health
outcomes for the benefit of society. We are thankful to Greg Landrum for sharing
information with Pistoia’s UDM team and Elsevier on how the RDKit works and what
contributed to its success. The shared information is already informing the
next steps of the UDM Project’s transition. Finally, the gathered knowledge from
this interview might help commercial companies and research organisations to build
and to maintain future relationships with various types of Open Source and Open
Documentation Projects.

R&D Solutions for Pharma & Life Sciences

We’re happy to discuss your needs and show you how Elsevier’s Solution can help.

Contact Sales



Source link