Introduction Using Data from DNAproDB

Data Format Parsing a Data File

Report Page Customization and Features Residue contact map Helical contact map Helical shape overlay

Searching the Database

Search Form Search API

Introduction

DNAproDB is a processing pipeline, web server, and database that aims to assist researchers in performing structural analysis of DNA-protein complexes. The DNAproDB processing pipeline takes as input the three-dimensional structure of a DNA-protein complex (in mmCIF or PDB format) and extracts useful bio-physical information from it, using a combination of our own tools and well-known third-party libraries and software. Information is organized into three main categories: DNA features which describe the sequence and structure of the DNA present in the complex, protein features which describe the sequence and structure of the protein in the complex, and interface features that describe the various DNA-protein interactions and statistics of those interactions present in the complex. This information is then combined in a hierarchical way and stored in a JSON file, which is ideal for easy parsing and integration with the web.

DNAproDB provides tools for exploring and visualizing the data extracted from a complex. These tools are found on the report page for each structure, which can be accessed by searching the database, typing the PDB identifier in the search box at the top of any page on the website, or following the link generated when uploading a structure. These tools allow you to visualize various DNA-protein interactions side-by-side with a three-dimensional view of the complex. Clicking and hovering on different parts of the visualizations will display additional information or highlight parts of the three-dimensional structure and allow the user to visually explore different layers of the DNA-protein interface. These visualizations can be customized and exported as a PNG file.

DNAproDB is also a database. We have pre-processed a large number of DNA-protein complexes retrieved from the Protein Data Bank (PDB) which we store in a database - hence the "DB" in "DNAproDB". The database is designed as a document-oriented database and is implemented in MongoDB. The data stored in the DNAproDB database is identical to the data produced by the processing pipeline and the data available in the report pages. It is returned in the form of a JSON file.

On the remainder of this page, you'll find information about how to use the various aspects of DNAproDB. Have a scroll through or use the side-navigation to the left to jump to a specific section.

Using Data from DNAproDB

Any page that displays information to the user about a DNA-protein complex is generated from the data stored in a DNAproDB data file (JSON format) corresponding to that complex. The data files are self-contained, that is, a single data file describes all the information about a particular DNA-protein complex structure in one place. For structures that come from the PDB, this includes a small amount of information that is retrieved from external databases such as UniProt, JASPAR, and the Gene Ontology knowledgebase. This information is not included for uploaded structures since it is assumed these are unpublished - if they are, search the database for them instead. Users can download our data for their own purposes offline (all we ask is a citation). You can download the entire database as a flat file from the download page, download data for a set of structures returned from a query by searching the database, or can visit the report page for an individual structure and download the data there.

Data Format

In order to understand the way DNAproDB data is organized, it is important to define some key terminology that is used throughout the site and to describe the conceptual framework DNAproDB is built on. Below is a diagram representing the conceptual framework of how we organize structural data in DNAproDB, which is based on the Protein Data Bank's framework.

Below is the basic schema for the way we organize the data in DBAproDB. In general, every DNA-protein complex that is stored in DNAproDB is represented by a JSON file with the same basic structure. However, some fields are of variable length, which depends on the size and complexity of the structure. Use the interactive diagram below to explore the structure of the data file.

As a more concrete example, below is a JSON file that is returned from DNAproDB for the small structure 1jgg. Click on the different fields to expand them and explore the data structure.

We use various abbreviations in feature names throughout the DNAproDB data files. Reference the table below for the meaning of these abbreviations.

abbreviation	full name	description
nuc	nucleotide	A DNA nucleotide (nucleic acid).
res	residue	A protein residue (amino acid).
pp	phosphate	A DNA backbone structural moiety that consists of the phosphate group.
sr	sugar	A DNA backbone structural moiety that consists of the pentose sugar group.
wg	major groove or "wide groove"	A helical DNA base structural moiety that consists of the base edge that is exposed in the major groove.
sg	minor groove or "small groove"	A helical DNA base structural moiety which consists of the base edge that is exposed in the minor groove.
bs	base	A DNA structural moiety which consists of the entire nucleoside base.
mc	main chain	A protein structural moiety which consists of the residue backbone.
sc	side chain	A protein structural moiety that consists of the residue side chain or functional group.
H	helix	Simplified protein secondary structure consisting of 'H', 'G', or 'I' helices using DSSP notation.
S	beta strand/beta sheet	Simplified protein secondary structure consisting of 'E' using DSSP notation.
L	loop	Simplified protein secondary structure consisting of 'B', 'T', 'S', and '-' using DSSP notation. Anything that is not a helix or a strand.
sse	secondary structure element	A continuous sequence of residues that share a helix or strand conformation.
basa	buried solvent accessible surface area	The amount of solvent accessible surface area that is lost when two components of the complex are bound to each other.
fasa	free solvent accessible surface area	The solvent accessible surface area when the component is in a "free" state (e.g. the SASA of the DNA when the protein is removed from the structure).
sesa	solvent excluded surface area	The solvent excluded surface area of a component.
hbond	hydrogen bond	A favorable electrostatic interaction between a polar hydrogen (donor) and an electronegative atom (acceptor).
vdw	van der waals	Two atoms which are in close proximity and may form induce dipole interactions.
cv	circular variance	A measure of the protein surface geometry.
wa	water atom	A water atom present in the structure which is usually involved in water-mediated hydrogen bonding.

Parsing A Data File

In addition to extracting information from a data file in-browser via the report page, one can download DNAproDB data files, which contain all the information about a DNA-protein complex provided by DNAproDB, and parse those files offline. In general, all you need is a programming language that supports a JSON parser and the parser itself. A long list of JSON parsers is available at the official JSON web-page.

Below are some simple examples in Python. Note that neither JSON nor the structure of the DNAproDB data files are specific to Python - the user can work in whatever language they prefer. When downloading data for multiple structures from the DNAproDB database, the data will be returned in a single file, which contains an array of objects. Each object corresponds to a single structure.

Python 2 example - counting hydrogen bonds in the DNA major groove

In this example, we'll count the number of hydrogen bonds made by arginine residues in the DNA major groove, assuming our data file contains some helical DNA, and print the result to the console.


import json
with open("data_file.json") as DATA:
    dnaprodb_data = json.load(DATA)                                  # load the data file using the json module

hb_count = 0                                                         # store the number of hydrogen bonds which meet our criteria
model_num = 0                                                        # choose the first model in the structure
for interface in dnaprodb_data["interfaces"]["models"][model_num]:   # iterate over each DNA-protein interface for model 'model_num'
    for interaction in interface["nucleotide-residue_interactions"]: # iterate over each nucleotide-residue interaction
        if(
            interaction["res_name"] == "ARG"                         # check if the residue in the interaction is an arginine
            and 
            interaction["nuc_secondary_structure"] == "helical"      # check if the nucleotide is in a helical conformation
        ):                        
            hb_count += interaction["hbond_sum"]["wg"]["sc"]         # if so, add the number of major groove-side chain hbonds to count
print("Arginine - Minor Groove Hbonds: %d" % count)                  # print the result

Python 2 example - identifying interactions in regions of narrow minor groove width

In this example, we'll find all nucleotide-residue interactions that occur in the DNA minor groove and where the minor groove width is less than 5 Å wide. We'll choose the first model in the structure (model 0), and loop over all DNA entities, and all helical segments within each entity, and record the nucleotide identifiers that correspond to narrow minor groove widths. We then loop over all nucleotide-residue interactions and find which interactions occur with a nucleotide in a narrow minor groove region.


import json
with open("data_file.json") as DATA:
    dnaprodb_data = json.load(DATA)                                            # load the data file using the json module


model_num = 0                                                                  # choose the first model in the structure
narrow_minor_groove_nucleotides = []                                           # an array to store nucleotide identifiers in narrow minor groove regions
for entity in dnaprodb_data["dna"]["models"][model_num]["entities"]            # loop over DNA entities in model model_num
    for helix in entity["helical_segments"]:                                   # loop over helices for each entity
        for i in xrange(helix["length"]):
            if(helix["shape_parameters"]["minor_groove_curves"][i] < 5.0):     # check if minor groove at position 'i' is less than 5.0
                narrow_minor_groove_nucleotides.append(helix["ids1"][i])       # if so push nucleotide id at position 'i' from strand 1
                narrow_minor_groove_nucleotides.append(helix["ids2"][i])       # and from strand 2

narrow_minor_groove_interactions = []                                          # array to store interactions meeting our criteria
for interface in dnaprodb_data["interfaces"]["models"][model_num]:             # iterate over each DNA-protein interface for model model_num
    for interaction in interface["nucleotide-residue_interactions"]:           # iterate over each nucleotide-residue interaction
        if("sg" in interaction["nucleotide_interaction_moieties"]):            # check if this interaction involves the minor groove
            if(interaction["nuc_id"] in narrow_minor_groove_nucleotides):      # check if the nucleotide is in our list of narrow minor groove nucleotides
                narrow_minor_groove_interactions.append(interaction)           # if the above conditions are met, push the interaction

# now the array 'narrow_minor_groove_interactions' contains all minor groove 
# interactions which involve nucleotides in narrow minor groove regions.

Troubleshooting

Occasionally, you may experience errors when using DNAproDB. If errors occur on the frontend (i.e., not involving structure processing), we recommend trying incognito or private browsing mode for best results. Each DNAproDB page contains instructions for usage and requirements (e.g., upload, search). Please refer to these instructions if you experience errors. On the visualization page, instructions for plot usage can be found by hovering over the plot name.

DNAproDB is often updated, which can lead to temporary downtime despite being able to access the website. If you see errors, we recommend checking back in a day or so. You can also refer to the paper or access the backend GitHub to run structure processing locally. Lastly, you can contact the Rohs lab here.

Visualization

The Report Page

An example visualization can be viewed here. By default, you will see the 'Residue contact map' on the right. Hover over this text with your mouse to see additional information and instructions. Hover, click, and drag to interact with the graph.

On the left, you will see a 3D viewer of the structure that corresponds to the 'Residue contact map.' You can also click 'Data explorer' to see the JSON data stored in the DNAproDB database for a structure. To view alternative visualizations, including a 'Helical contact map' and 'Helical shape overlay,' click their corresponding tabs in the upper right. As before, hover over their text to see additional information. Customization and Features

Click 'Chart options' on any visualization to show additional customization options and features. Some options may be available already above the plot. In the 'Residue contact map,' by default, the plot layout is generated using RNAscape which considers tertiary interactions. Click the 'Select mapping algorithm' button above the chart to change to a Secondary structure-based (ViennaRNA) or circular layout.

Click the 'Interface selection' button above a graph to select the DNA-Protein interface considered for visualization. By default, only major groove, minor groove, and base interactions are shown. In this menu, one can select certain protein chains and DNA moieties (e.g., sugar, phosphate) for visualization. Click 'Show advanced options' to see more in depth information, allowing you to exclude residues and specify interaction criteria. Lastly, click 'Visualize interface' to apply your changes.

Residue contact map

The Residue contact map shows individual nucleotide-residue interactions, DNA secondary structure, protein secondary structure and DNA interaction moieties. The DNA is displayed as a graph, with nucleotides being nodes and edges between them indicating backbone links, base pairing or base stacking. Different base pairing geometries are indicated via the base-pair edges, and other structural features such as backbone breaks, missing phosphates, and the DNA strand sense are represented. Protein residues are displayed as small nodes with the node shape and color representing residue secondary structure. Edges between residue and nucleotide nodes represent an interaction between the two and which DNA moiety(s) the interaction involves.

Use the Chart options button to modify labels, change the color scheme, modify the graph, and more. You can also

Click to select individual residues or nucleotides and highlight them in the 3D view (hold SHIFT for multiple selections)
Double Click on labels to re-name them
Drag components to reposition them
Scroll to zoom in or out on the plot
Hover above residues, nucleotides or interaction lines to display additional information (can be toggled off in chart controls)

Helical contact map

The Helical contact map plots protein secondary structural elements (SSE) interactions along the helical axis for a selected DNA helix. Interactions of SSEs with different DNA moieties are represented in concentric annuli and the position of each plotted SSE is given in helicoidal coordinates; a curvilinear coordinate system defined by the axis of a helical DNA segment.

Use the Chart options button to modify labels, change the color scheme, modify the graph, and more. You can also

Click to select individual residues or SSE and highlight them in the 3D view (hold SHIFT for multiple selections)
Double Click on labels to re-name them
Hover above a SSE to display additional information
Drag text labels to reposition them.

Helical shape overlay

The Helical shape overlay plots DNA shape parameters (such as major and minor groove width, or helical parameters) for a selected helix along the sequence of the helix. In addition, DNA-protein residue interactions are plotted showing approximately where each residue in the interface interacts along the DNA sequence and the secondary structure of that residue. Protein residue interactions can be toggled off if only DNA shape parameters are desired, or they can be displayed to indicate possible DNA shape readout, such as the presence of positively charged residues in regions of narrow minor groove width.

Use the chart options button at the top left to modify labels, change the color scheme, modify the graph, export it as a PNG file, and more. You can also

Click to select individual residues and highlight them in the 3D view (hold SHIFT for multiple selections)
Double Click on labels to re-name them
Hover above residues to display additional information
Drag text labels to reposition them.

Searching DNAproDB

DNAproDB's database has powerful search capabilities that allow users to search for structures based on characteristics of the DNA, the protein, or the DNA-protein interface. Multiple criteria can be combined using different selection logic. In addition, you can search for structures with their PDB identifier or supply a list of PDB identifiers in combination with additional criteria, which act as a filter. This search capability is unique to DNAproDB and can be used to generate data sets or find structures based on structural and bio-physical features of the DNA, protein, or DNA-protein interactions present in the structure. The easiest way to search the database is through the search page, where you can construct a query using a simple expandable form.

The search form can be accessed from the search page. When first presented, the form will be empty and contain no fields. Choose from one of the feature categories and you'll be presented with different fields you can fill in which correspond to the different aspects of a DNA-protein complex (protein properties, DNA properties, or DNA-protein interaction properties).

Choosing one of the feature categories will update the form with a panel that shows different fields corresponding to that category. Additionally, you may choose to add more categories, or clear all the categories with the "add another category" and "clear all categories" buttons, remove an individual category using the "remove" button, and choose how different categories are combined logically.

Once chosen, any fields within a category that are left blank or unchecked will be ignored, so you only have to choose items that you care about. You can choose to search for structures that match any selected feature within a category, or that match all selected features. For example, if you are matching on all selected features and choose DNA properties and specify that the sequence length should be between 10 and 20 base-pairs, and also check that the DNA Conformation should be B-form, then only structures that meet BOTH requirements will be returned. However, for fields which have multiple options such as B-DNA, A-DNA, or Other under DNA Conformation, choosing multiple options will match structures that meet ANY of the selected options. So, if you chose the sequence length to be between 10 and 20 base-pairs, and checked B-DNA and A-DNA under DNA conformation, then that will return structures in which the DNA is between 10 and 20 base-pairs in length AND is either B-DNA or A-DNA like.

Search API

The DNAproDB database can be searched using a web API, but requires one to be familiar with both the MongoDB query syntax and the DNAproDB data file structure. DNAproDB is built using MongoDB version 3.4 (documentation for their query language is available here). We provide options to search the database by passing a JSON query string to our API with an optional projection string, or to retrieve data for an individual entry by its PDB identifier. We also provide some convenience options such as getting the total number of entries in the database. All data is returned in JSON - either a single document or an array of documents. Note that the use of the $where operator in queries is forbidden - any query using this operator will be rejected. See the table below for information about using our API.

URL	`/cgi-bin/request-data`
HTTP Method	`GET` \| `POST`
URL Params	Querying `query=[JSON string] (required)` example: `query={"protein.chains.organism": {"$regex": "homo sapiens", "$options": "i"}}` `projection=[JSON string] (optional)` example: `projection={"structure_id":1, "protein.chains": 1}` Retrieve a single entry `pdbid_id=[four character PDB identifier]` example: `pdb_id=1jgg` Get the number of entries `count=[boolean]` example: `count=true` Get last time the database was updated `last-updated=[boolean]` example: `last-updated=true` Get list of all PDB identifiers `pdbid-list=[boolean]` example: `pdbid-list=true`
Examples	Here we use jQuery to make a POST request using the 'query' parameter to find all entries with a human protein chain. We use the `$regex` operator to look for protein chains where the 'organism' feature matches "homo sapiens", case-insensitive. $.ajax({ url: "https://dnaprodb.usc.edu:/cgi-bin/request-data", dataType: "json", data : { 'query': '{"protein.chains.organism": {"$regex": "homo sapiens", "$options": "i"}}', }, type : "POST", success : function(result) { console.log(result); } }); Here we send a GET request with wget to retrieve the number of entries in the database. wget -O count.json https://dnaprodb.usc.edu/cgi-bin/request-data?count=1 cat count.json
Notes	The use of the `$where` operator is not allowed. Queries using this operator will be automatically rejected. When using regular expression please use the `{"$regex": pattern, "$options": [options]}` syntax. If you require to use regular expressions inside of an `$in` expression, please convert this to an a single regular expression instead. For example, rather than using `{"$in": [\pattern1\, \pattern2\, exact3]}`, convert this to `{"$regex": "pattern1\|pattern2\|^exact3$"}`.