Introduction

DNAproDB is web server and database which aims to assist researchers in performing structural analysis of DNA-protein complexes, and specifically to help the user understand the way in which DNA (double-stranded) and proteins interact. Many excellent programs and algorithms have been developed for analyzing various structural aspects of biological macromolecules. However, developing an automated workflow for processing structural data is difficult and time-consuming. In order to get a full picture of the structural features and biophysical interactions which are most often of interest when studying DNA-protein complexes, one usually has to employ a suite of software packages, each of which specializes in a particular aspect of biomolecular structure and has a unique interface, and may have special format requirements, quirks, and pitfalls. Further, structures vary widely in their complexity and quality, and there are many edge-cases to consider when attempting to perform structural analysis on large datasets of DNA-protein complexes.

Once generated, one has to parse the data that has been extracted from the structures, and organize this very high-dimensional data in a consistent way that is useful for performing the desired analysis. Finally, as a crucial component of the users analysis, one should visualize their data in a clear and insightful way. This entire process adds up to a highly non-trivial task. DNAproDB was designed to address all of these concerns, and automate as much of the process as possible, to allow the user to spend the most amount of time actually analyzing data, and the least amount of time generating it.

DNAproDB has three major components. The first is the processing pipeline, which takes as input structure files (in mmCIF or PDB format) and extracts useful biophysical information from them, using a variety of third-party libraries and software, as well as some of our own code. This information is then combined in a hierarchical way and stored in a JSON file, which is ideal for easy parsing and integration with the web.

The second component of DNAproDB is our visualization tools, which are found in the report page for a structure (here a structure always refers to the structure of a protein bound to double-stranded DNA). These tools allow you to visualize various DNA-protein interactions side-by-side with the three-dimensional view of the complex. Clicking and hovering on different parts of the visualizations will display additional information, or highlight parts of the three-dimensional structure, and allow the user to visually explore different layers of the DNA-protein interface. These visualizations can be customized and exported as a PNG file.

Finally, the third component of DNAproDB is our database. We have pre-processed a large number of DNA-protein complexes retrieved from the Protein Data Bank (PDB) which we store in a database - hence the "DB" in "DNAproDB". The database is designed as a document-oriented database, and is implemented in MongoDB. All data retrieved from DNAproDB (whether is is an uploaded structure or an entry in our database) is returned in the form of a JSON file. JSON is a popular text-based data interchange standard, similar in some ways to XML, but in general much easier to use.

On the remainder of this page you'll find information about how to use the various aspects of DNAproDB. Have a scroll through, or use the side-navigation to the left to jump to a specific section.

Using Data from DNAproDB

Every page on this website which displays information to the user about a DNA-protein complex generates the content of that page from the data stored in a JSON file corresponding to that complex. The JSON files are self-contained, that is, a single JSON file describes all the information about a particular DNA-protein complex in one place. This includes a small set of information which is retrieved from external databases, such as UniProt and the PDB, which is not available directly in the mmCIF files (for obvious reasons, this information is not included for uploaded structures since it is assumed these are unpublished - if they are, search the database for them instead). Therefore, while using this web-site is very easy and convenient for the user, the user has the option to download our data and do whatever they would like with it (all we ask is a citation).

Data Format

Below is the basic schema for the way we organize the data in DBAproDB. In general, every DNA-protein complex which is stored in DNAproDB is represented by a JSON file with the same basic structure. However, many fields are of variable length, which depends on the size and complexity of the structure. Use the interactive diagram below to explore the structure of the datafile and read the description about each field. Note that the field "description" is only used here for documentation purposes, and does not appear in actual datafiles retrieved from DNAproDB.


As a more concrete example, below is a JSON file which is returned from DNAproDB for the small structure 1jgg. Click on the different fields to expand them and explore the data structure.

Note that when downloading data for multiple structures from the DNAproDB database, the data will be returned in a single file, which contains and array of objects. Each object corresponds to a single structure.

Parsing A Data File

In addition to extracting information from a data file in-browser via the report page, one can download a JSON file which contains all the information about a DNA-protein complex which is stored in DNAproDB, and parse that file offline. In general, all you need is a programming language which supports a JSON parser, and the parser its self. A long list of JSON parsers is available at the official JSON web-page. Below is a simple example in Python. We'll extract all the Hydrogen Bond information from the interactions.nucleotide-residue_interactions.hbonds fields, and count the number of times we have a hydrogen bond involving an Arginine in the minor groove.

Python Example

import json
with open("data_file.json") as DATA:
    dnaprodb_data = json.load(DATA)

count = 0
for interaction in dnaprodb_data["interactions"]["nucleotide-residue_interactions"]:
    if(interaction['hbonds'] is not None):
        for hbond in interaction['hbonds']:
            if(hbond["res_name"] == "ARG" and hbond["groove"] == "sg"):
                count += 1
print("Arginine - Minor Groove Hbonds: %d" % count)
                        

Searching DNAproDB

DNAproDB's database has powerful search capabilities that allow users to search for structures based on characteristics of the DNA, the protein, or the DNA-protein interface. Multiple criteria can be combined using different selection logic. In addition, you can search for structures with their PDB identifier, or supply a list of PDB identifiers in combination with additional criteria, which act as a filter. This search capability is unique to DNAproDB, and can be used to generate data sets, or find structures for which you do not know the PDB identifier of. The easiest way to search the database is through the search-page, where you can construct a query using a regular HTML form. The database can also be searched using a RESTful API, but requires one to be familiar with the mongoDB query syntax.

The Search Form

The search form can be accessed from the search page. When first presented, the form will be empty and contain no fields. Choose from one of the critera, and you'll be presented with different fields you can fill in which correspond to the different aspects of a DNA-protein complex (protein properties, DNA properties, or DNA-protein interaction properties).

Choosing one of the critera will update the form with a panel which shows different fields corresponding to that criteria. Additionally, you may choose to add more criteria, or clear all the criteria with the "add another criteria" and "clear all criteria" buttons, remove an individual criteria using the "remove" button, and choose how different critera are combined logically.

Once chosen, any fields within a criteria which are left blank or unchecked will be ignored, so you only have to choose items which you care about. Currently, any matching structure must match all the defined fields within a specific criteria.. For example, if you choose the DNA properties, and specify that the sequence length should be between 10 and 20 base-pairs, and also check that the DNA Conformation should be B-form, then only structures which meet BOTH requirements will be returned. However, for fields which have multiple options such as B-DNA, A-DNA, or Other under DNA Conformation, choosing multiple options will match structures which meet ANY of the selected options. So, if you chose the sequence length to be between 10 and 20 base-pairs, and checked B-DNA and A-DNA under DNA conformation, then that will return structures in which the DNA is between 10 and 20 base-pairs in length AND is either B-DNA or A-DNA like.