Steps to an open tool for pKa-prediction

Update, 16th Sep, 2014: Started an writeLatex-document on this topic.

Is there an open source pKa or LogD tool available?. This question was asked 4 years ago. Still, there is no really good tool available. This post is about my thoughts about creating one.

So, what do we need for an open tool for pKa-prediction? First, an open database is required to which raw titration data sets can be submitted. The pKa-values of these data sets have to be determined (including statistics) and then linked to the structure and moieties of the corresponding molecules. Finally, methods and models for pKa-prediction based on and/or trained with this data have to be developed.

Acquiring of titration data

In order to get a lot of raw data Chris Swain suggested to let undergraduate students determine pKa-values and put the results in an open database. Although this is a good idea, it requires some preparatory work.

pKa-values are thermodynamic variables, which means that they depend on many environment parameters such as temperature, pressure, ion strength, etc. Also, the concentration (and the grade) of the used substances and solvents (water, DMSO) play a role as well as the used instruments (accuracy of pH-meter, thermometer). Theoretically, the stirring speed and the waiting time between adding a drop and readout of the pH-meter plays a role. What about special cases such as when the solubility or handling strongly depends on the pH (e.g. precipation on low pHs or handling oily substances) or when the pKa is over 12? Hence, standard protocols (plural!) are needed for performing the titration of substances and all recorded data has to be submitted.

Someone could object that using a group of inexperienced students will result in bad data sets. How would you control that the data submitted to the database is ‘good’ (whatever that means) and accurate? I think, statistics will compensate inaccuracy of individuals in this case. If 1000 data sets of students for one compound are submitted – it is nearly impossible that they did all the same random errors. Of course, there could be an systematical error. This should be out of question with a good standard protocol, though. On top of that, a supervisor/instructor who wants to support a project like this by contributing data to it, will have his eye on the students and their performance (I would!).

Update, 6th Sep, 2014: These protocols should be developed by the students themselves. As Anne said in the comments: ‘ Let´s give people the time and experience they need to become data curators, like JCB did with their students. Said that, of course, it should have some quality indicators. Why not discuss this and elaborate with the students?

Finally, – and this is the reason why I really like this idea from Chris – it will motivate students performing their experiments with great care if they know it has a value. Think back at your lab courses in the 1st, 2nd, or 3rd term/semester/year. You did your experiments, produced some data, the instructor/supervisor reviewed it, and literally … threw it away (or you did it). Reproduction of experiments had no value except practicing and passing the lab course. Wouldn’t it be more satisfying for a student if he knew his data was used in ‘real’ research and that he contributed to it? I think it would!

Setting up a database

Technically, it does not really matter what sort of database one uses for saving the data from above. Only the interfaces for submitting and accessing the data are important. First, people should not have to pass extra hurdles (e.g. logins) when they want to contribute data. It should be easy, straightforward, and rewarding!

Secondly, the data in the database can serve as starting point/base for many projects. It can be used not only for pKa-prediction but for safety reasons (could use it in Beryllium10, for example) or in metabolism/toxicological research projects or as base for semi-empirical quantum chemistry methods or, of course, ‘just’ as a reference. Thus, having such a database (with many, extensive data sets) is already a great treasure!

Update, 6th Sep, 2014: A wiki as used by the Open Notebook Science Challenge seems like a good option. However, the problem would be to establish the ‘no peeking’-rule as suggested by Peter (see comments). The open data principle automatically eliminates this, IMHO. Everyone has access to the data, so everyone can peek. Maybe this should be discussed with the students when introducing them to this?

pKa determination and linking to molecule structure

It is simple to automatically extract pKa-values from a titration plot as long as the type and numbers of acid/base groups are know. Think at the very linear titration plot of citric acid, for example. If you know it is the result of three deprotonation you can easily readout the data and link it to the corresponding moieties.

Actually, this could be done when submitting titration data. The user is presented with the results of the automatical determination, draws the structure of the compound (if not already present), and connects the values to the corresponding protons. On top, the user can choose to manually set, add, or delete pKa-values if the algorithm fails partly or completely – and so train the algorithm. There will be more than one data set for each compound, which gives you the ability to use statistical tools.

pKa prediction

This is the interesting part. It is straightforward, though. Now, that an extensive base set exists (we assume that we have the database from above filled already in), you can develop models based on it and methods, which can be trained by it! Since the data and the database is open and free, just everyone can use it and try to develop something. Of course, although that might sound easy, it is still challenging.

There are all sort of problems with prediction. For instance, symmetry! Think of citric acid again, which has three pKa-values instead of just two. Also, IMHO a good pKa-prediction tool should also be able to predict the dissociation constant of every proton in a molecule such as of aldehydes or just a usual C-H bond. This could be tricky, though.

My idea for a GUI version of a prediction tool would be something very simple. Let the user draw a structure (or load a file or a structure-key) and the tool predicts every pKa-value for every proton present. Nothing more needed (saving the data of course).

It becomes more clear that this is not easy if you look at some implementations. ToxPredict calculates a pKa of 5.21 (instead of about 10) for phenol. Also, it gives you just one pKa for citric acid. Additionally, it does not link this values to any moieties in the molecule.

If you search the literature, you will find all sort of prediction models for predicting values for proteins, alcohols, and carboxylic acids. And, you will find papers about software such as Epik, which is part of the proprietary Schrödinger suite. Nothing open, though (except AMBIT, which is used by ToxPredict).


The key for developing new and better models and methods, and in turn a prediction tool, is an extensive base data set. Optimally, a free and open data set, everyone can use, which is well documented. So, let us create one!

My two cents.

Update: Remarks (26th Aug, 2014)

6 thoughts on “Steps to an open tool for pKa-prediction

  1. Antony Williams

    It might be easier to simply provide an interface for students to enter data harvested from the literature? i.e. draw a structure and enter the pKa’s associated with each ionizable site. Also enter the literature reference/doi and allow for additional annotation. Surely faster than asking students to measure the data and all the possible errors that might ensue…not that there aren’t errors in the literature for sure!

    1. Sven KochmannSven Kochmann Post author

      Here at the University of Regensburg, students have to pass lab courses in analytical chemistry (it is a major here, so everyone has to do it). In the very first one, they have to actually titrate acids/bases (e.g. to determine them by molar mass). After passing the experiment/lab course, however, the data is thrown away. With a little modification to the protocol the data could be used for such a database. I’m pretty sure that we are not the only university/faculty with such lab courses in analytical chemistry. 🙂

      I see some issues with the literature harvesting you suggested. First, literature data often comes out of nowhere, especially if its some minor data. Except for very old publications they are not providing the raw titration data or a detailed protocol (Mostly stating sth like ‘The pKa of X was determined by titration with Y. It is 6.1’). On top, if they do – we do have to ask for permission to use it for a database (which is republishing). The question is: do we really want to base an open product on closed (and proprietary) data we could not even fully rely on?

      Secondly, you will not find like 1000 data sets for one single compound in the literature. Mostly, one does it and it is then cited… and cited… and cited.

      Finally, I think it has more value to let the students do an experiment and evaluate it, than sitting them in front of a computer and use them as literature harvesters.

      Edit: I think it should be in accordance with Peter Murray-Rusts concept of ‘born open’. I like that!

  2. Peter Murray-Rust

    I think this is a great idea. Unfortunately I don’t have any students!

    It’s exactly what Jean-Claude would have done.

    And let’s give the students the opportunity to discover that the protocols may need enhancing. That’s what happened in Galaxy Zoo – when something unusual happened the community discussed whether it was a real effect (it was, and they got a paper out of it).

    This will give a wonderful opportunity for the world to see what variation occurs in scientific experiments. Do multiple observations reduce random error? We expect so but it’s important to confirm it.

    The important thing is that the Open Notebook should not be influenced by other measurements on the same compound. If the “expected” value of property P is 12.4 and you get 1.24 you mustn’t assume you have missed a factor of 10. Maybe the others did. Maybe you are “right”. And this will give a great indication of what transcription and other errors occur.

    So in addition to “no insider knowledge” we should perhaps add “no peeking” prior to the measurement. You devise a protocol and publish it; you carry out and record the experiment and record it publicly in real-time. Only later might you look to see how well your result “agrees”. Because otherwise there is a subconscious motivation to converge on a “correct” value.

    I believe that Clathrates were discovered in an Oxford undergraduate laboratory (or perhaps merely that they could have been). Reduction of quinone by SO2 created a precipitate which no-one thought to analyze – it was quinol-SO2 clathrate. ONS might have discovered this if we ask every undergraduate experimenter to keep their eyes and brain open.


  3. Anne Clinio

    Peter shared with me this post because he knows of my interest in open notebook science and
    JCB ideas. Since I don´t know much about chemistry, I will give my two cents about with what
    I have already read about JCB initiatives.

    First, I think we should study “The Open Notebook Science Challenge” and learn with what was done previously. Maybe Cameron Neylon and others authors related to this initiative could help. I can research about its methods and strategies and made some suggestions. ;]

    JCB seems to have great passion on teaching and great respect for his students. And I think
    it is a shared feeling here. So, I would not give so much credit now to criticism about the
    information accuracy of students work. There is a bit of prejudice behind this statement, isn’t it? Let´s give people the time and experience they need to become data curators, like JCB did with their students. Said that, of course, it should have some quality indicators. Why not discuss this and elaborate with the students? Open Notebook Science Challenge applied quality criteria instead of quantity of measurements to acknowledge students work. I think this is a great way of dealing with it. The description of pka prediction seems to me so complex that requires a huge effort on documentation. It should be more real time than JCB suggested !! Less than 12 hours delay for sure!

    It is very interesting how JCB worked with the established literature, transforming his student into data curators. He created a crowd sourcing process that opened, somehow, relevant info that was behind pay walls through its Google spreadsheets and experiences. This is an advantage of working with the literature, to criticize it. But, as Peter said, “no peeking” should be important as well, so there isn´t a subconscious motivation to converge on a “correct” value. For example, I like the way Peter exemplifies manual content mining with printed texts, color pens and some categories. Can we create a way of experimenting chemistry that shows the limitations of just following, blindness, the recipe. How stimulate creativity?

    About how to motivate students, it could start with a broad explanation about open research to the students, so they know they are evolved in something bigger. Also, present the Open Notebook Science Challenge initiative and its collective and individual achievements should motivate them as well. Last, but not least, students want tangible recognition of their efforts. Some ways of doing this is giving extra points on grades, transform this task part of their evaluation or a monthly acknowledgement, like the challenge did.

    Please, let me know your thoughts on this.


    1. Sven KochmannSven Kochmann Post author

      Thank you both, Anne and Peter, for your comments!

      You’re right. Students should be involved into the complete process from designing the experiment(s) to performing it to discussing the results. I really like that and I am thinking about taking some students out of class and just try the following (‘unfortunately’ I am moving to Canada in a couple of weeks so that has to wait a little):

      Sitting together with 5-10 (motivated) students, telling them what is the problem, what we want to achieve, and then let THEM discuss HOW to achieve this. What should the experiment(s) look like? What do we need? W…? … Subsequently, let them do the experiment(s) and later let THEM discuss the results/notes/… and submit it. Should not take more than one or two afternoons (maybe three). I would give them extra points on grades if they would do that!

      The whole time I am thinking about an appropriate platform for collecting data to start. A wiki like ONSchallenge seems good, though. However, the problem would be to establish the ‘no peeking’-rule. I have no idea for that, since the open data principle automatically eliminates this, IMHO. Everyone has access to the data, so everyone can peek. Maybe this should be discussed with the students when introducing them to this? How to do?


Leave a Reply

Your email address will not be published. Required fields are marked *