Quantitative Structure-Activity Relation Study of Quaternary Ammonium Compounds in Pathogen Control: Computational Methods for the Discovery of Food Antimicrobials

Bacterial infection on the surface of fresh meats and produce after processing is currently one of the largest problems within this industry. Bacteria that cause most foodborne illness and include, but are not limited to, shiga toxin producing Escherichia coli (E. coli) and Salmonella typhimurium (S. typhimurium) [1,2]. Not only Received: May 27, 2016; Accepted: June 20, 2016; Published: June 25, 2016 Abstract


Quantitative Structure-Activity Relation Study of Quaternary Ammonium Compounds in Pathogen Control: Computational Methods for the Discovery of Food Antimicrobials
do these bacteria cause disease, but also spoilage. It is estimated that in 2010 the United States of America threw out 133 billion pounds of food, mostly due to spoilage [3]. These bacteria cannot be removed by simple water spraying implemented by most processing facilities [4]. As such, many technologies have been developed to combat bacteria on the surface of food products.
This article is available in: www.cheminformatics.imedpub.com/archive.php
Unfortunately, the current technologies being used to remove bacteria from these surfaces suffer from a variety of issues: high cost, hazardous byproducts, environmental hazards, and the discoloration of products [9,10]. This study focuses on the use of cetylpyridinium chloride (CPC) for decontamination. CPC is an effective antimicrobial, it has been approved only for use on raw chicken, although it has also shown effectiveness for use on beef and produce for both disinfection and the extension of shelf life [11][12][13]. CPC is classified as a quaternary ammonium compound (QAC), which is defined by its cationic nitrogen head. Generally QACs work as antimicrobials by disrupting cell walls and membranes with hydrophobic tails. These tails pinch off sections in small vesicle-like structures and cause cell leakage that eventually leads to cell death [14][15][16]. CPC follows this same mechanism along with evidence of other more specific targets including transferrin denaturation, ionic channel blocking, and knock-down of halitosis specific transcription factors [14,17,18].
It does, however, have its own flaws. CPC is an environmental hazard and leaves a toxic residue on surfaces [13,19]. This residue is dissolved and subsequently removed using propylene glycol (PEG) as a cosolvent with water. Unfortunately, this adds to the cost and complicates the safe disposal of CPC [20]. Environmentally, the disposal of CPC is a major concerning factor. CPC is naturally broken down by bacteria, but in higher concentrations it kills the bacteria before it can be processed. In aquatic environments residual CPC causes a decrease in microflora and in algae blooms. This decrease causes a trophic cascade, negatively impacting all organisms in the local community [21]. The remnants of CPC in the environment can also propagate antimicrobial resistance in the local microbial communities, which can also have a lasting impact [22]. In humans, QACs taken orally in high doses (100-400 mg/kg) have shown detrimental effects including mucosal necrosis, hemorrhaging, formation of ulcers, and severe liver, kidney, and heart changes [23,24]. CPC in particular has been shown to cause liver and kidney vacuolization as well as paralysis when given orally to rats and rabbits [25].
Discovery of novel drugs is typically limited by the funds available and the precise knowledge of drug targets. Due to the nonspecific nature of CPC and imprecise library screening methods, our lab turned to qualitative structure activity relationships (QSAR). QSAR allows for the recoding of molecular structures to quantifiable forms which are then correlated to a specific biological activity. This model can then be used to predict the biological activity of untested structures [26]. The bioactivity that we wish to study is the minimum inhibitory concentration (MIC), which is a measure of the effectiveness of an antimicrobial. A lower MIC denotes a more effective compound. Using this method we hope to discover potential structures that could function as well as CPC, with reduced or nonexistent negative effects on the human body and the environment.

Data collection
Three sections of data were collected via literature searches (1) a model building set, (2) a validation set, and (3) a prediction set of compounds [27][28][29]. The model building set was based on known QACs with data on the MIC of these compounds against E. coli. Contained within the validation set were known QACs that were not used for the model building set. Compounds for the prediction set were collected from a substructure search on Pubchem using CPC as a reference. The top 1000 compounds sorted by relevance were selected for further testing.

Descriptor calculation
All descriptors for the model building set (Supplemental Data 1), the validation set (Supplemental Data 2), and the prediction set (Supplemental Data 3) were calculated simultaneously using the ochem.eu chemical database [30]. Using the tools on this site, the structures were cleaned by removing the salts associated with each compound. Under the models tab, calculate descriptors program was selected and the SMILES string for each compound was uploaded in an Excel file (.xls). These SMILES were used to calculate descriptors through this database. The descriptors that were selected are the following: E-state (all but extended indices), ALogPS, GSFragments, ISIDA fragments (from 2-15 in order to cover long carbon chains), and QNPR. These were selected due to a large number of compounds encountering errors during 3D structure calculations. Unless noted, all descriptors were left at the default settings. This totaled 1356 descriptors for each compound. The descriptors and the chemID's were then downloaded as a .cvs file ignoring any compounds that encountered an error. The model building set and the validation set had no errors and 163 compounds were removed from the prediction set due to errors in calculation.

Data preprocessing
After the descriptors were calculated, all data were normalized through the Normalize Data (v.1.0) tool developed by the Roy lab [31]. This is a Java program that requires a .csv file of the descriptors. The model building set data was then split into a test (15%) and training set (85%) via the Data set Division GUI (v.1.2) also developed by the Roy labs [32,33] (http://teqip.jdvu.ac.in/ QSAR_Tools/#ADInHouse).

QSARINS model calculation
Using QSARINS, an open source QSAR modeling software utilizing multiple linear regression (MLR), was used to create the QSAR model and to generate each prediction [34,35]. First, the model building set was altered to fit the QSARINS format. The MIC was then added to the descriptors column and the test and training sets were combined into a single file where each was given a numerical identifier (1 for training set, 2 for test set) in the last column of the file. This was saved as a .txt file. The software was run according to the protocol listed in the manual. We used their internal filters to remove all descriptors that had <80% consistency throughout the data set, or that were <95% correlated. The genetic algorithm was run for combinations of up to 130 descriptors based on the Q 2 loo. 840 models were created, using QSARINS available validation data. An arbitrary cutoff of R 2 >0.75, R 2 -Q 2 <0.10 (both loss of one and loss of many), and |Q 2 -Y-scramble| > 0.50 was used. Twelve models were left for further validation. Predictions for the prediction set and the validation set were performed using the built-in tool. (http://www.qsar.it/).

Model validation
In order to find a new chemical to treat meat surfaces, we performed a literature search for current QACs and their respective MIC against E. coli [27][28][29]. The compounds that we found had at least one cationic nitrogen and a carbon chain.
Other commonly identified structures include nitrogen, oxygen, benzene rings, and even barium in one compound. Activities of these compounds range from an MIC of 1.88 μg/ml to 12800 μg/ml. Using all available literature data on the antimicrobial activity of currently available QACs on E. coli represented by the log of the MIC, we developed 840 potential models using the QSARINS software. QSARINS systematically uses optimized descriptors to build models starting at 1 descriptor and building more complex models using a genetic algorithm (GA). The GA organizes the descriptors into genes in a chromosome and then other descriptors are substituted into this chromosome. This continues with a constant mutation rate for 500 generations. At the end of these generations each chromosome is used to create an MLR based QSAR model. The top five models (determined by the Q 2 loo) are kept for each iteration. The number of descriptors is increased as time progresses and more calculations are done. Due to computing limitations, this process was stopped at 130 descriptors, although most optimal models had fewer than eight descriptors. The top models had some descriptors in common, or at least very similar fragments. The H-C-O structure fragment was seen in 10 of the top 12 models. We organized these descriptors into four categories to explain the importance of certain types of descriptors for this model calculation: (1) short fragments (specific fragments of five atoms or less), (2) long fragments (specific fragments of more than five atoms), (3) non-specific fragments (fragments with general patterns and not specific structural identities, examples include C*C*N:(Fragmentor) in which "*" could be any atom), and (4) log of the lipophilicity which was calculated by A*log(PS) ( Table 1).
In order to select the best potential models from the 840 potential models, a general filter of R 2 >0.75, R 2 -Q 2 <0.10, and |Q 2 -Y-scramble| >0.50 was used to reduce the list to 12 potential models based on internal validation calculations done with the QSARINS software ( Table 2). The majority of compounds were removed due to the R 2 -Q 2 filter. An external testing dataset was then predicted by the model in order to perform an external validation. For this study we focused on the general prediction ranking (R 2 ) and the specific accuracy of our prediction (percent error). These were calculated and are displayed in Table 3. It is typical in the QSAR community to rely more on the general predictive ranking than to rely on accuracy alone, as these predictions will be used for filtering a larger list for experimental validation rather than for direct prediction [36]. Many of the models were very similar in their validations, therefore the most optimal model, 81, was selected to provide an example of the internal and external regressions (Figures 1 and 2).
Many studies have pointed to the effectiveness of using a consensus model for increasing the accuracy of the prediction of unknown models, rather than using a single model [36,37]. Using the twelve previously identified models, we averaged the predictions on the validation set to develop three different consensus models (Table 2 and Figure 3). One model was created from all available models. The second was made by selected models that had a R 2 >0.9 and an average error <20%. The third consensus was formed by removing model 65. This model had the worst external validations with an R 2 of 0.19 and an average error of 72%. These consensus models generally had lower error and higher R 2 than the single models. The removal of lesser models or the single worst model did not improve the accuracy of the consensus. From the validation data we determined that the consensus model made from all the available models, as previously described, would be the preliminary optimized model to use for predictions of unknown compounds.  Flow chart of the QSAR building process, with software used at each step.

Predictions for unknown compounds
The purpose of creating a QSAR model is to apply it to previously unstudied compounds with unknown biological activities. We collected a list of 1000 compounds from PubChem that had substructure similar to CPC [38]. By using the consensus model, the top 10 compounds in terms of MIC against E. coli were identified. Compounds that were in the applicability domain for at least 75% of the models within the consensus were included in the final list. This left us with 39 compounds. These compounds, their structures, and their predicted activities are shown in Table 4.

Discussion
Using literature values, a QSAR model was developed in order to predict the MIC of potential compounds that could be used to combat bacteria on the surface of food during processing. Our model was based on 47 compounds with available literature values with recorded MIC values against E. coli, collected across three different studies to increase the variation of structures and MIC values. Using the built in GA the best descriptors and the optimal number of descriptors were selected to avoid overtraining of the model. Some may argue that only using up to 130 descriptors could be a detriment to our study but, any calculations done with more than 15 variables there was a significant decrease in Q 2 leading us to believe that overtraining had occurred beyond that point.
Now that we have a viable QSAR model of MIC and preliminary predictions for almost 900 structures, we plan to experimentally validate the predicted MIC. After this validation our lab will focus on creating two more models 1) one to predict the environmental degradation of these compounds and 2) one that would predict the amount of residue that would be left on different food products when the compounds are used for sterilization. These steps will help us to discover a safer compound from the list of potential compounds.
Disinfectants in the food industry are incredibly important for the reduction of spoilage causing bacteria as well as those that can cause disease. Unfortunately, current techniques have many issues. One compound that is efficient in both cost and in antibacterial action is CPC, but the remaining residue must be removed or the products could become toxic. In order to find a comparable compound without the toxic residue, our lab developed a QSAR model that could predict the antimicrobial activity of potential compounds before experimental testing. This model will allow us and other labs to save money and time by specifically testing compounds that have predicted efficacy for antimicrobial behavior. By developing and testing new antimicrobial QACs we hope to not only reduce the bacteria on the surface of food in a safe manner, but also reduce the amount of antimicrobial damage to the local environment. With the addition of new QACs we also expect to help combat the rise in antibiotic/antimicrobial resistant bacteria. Predictions on an external validation, regressions of three different consensus models. In green, all available models that met our cutoffs, the blue is all models that had an external validation R 2 > 0.90 and an average standard error < 0.20, the yellow is a regression with the single worst model (having an R 2 of 0.19). Each R 2 is displayed under the legend heading for each data set.