GraphiaEnt:Input formats

From Kajeka Wiki
Jump to: navigation, search

Graphia Enterprise is a powerful network analysis platform. In order to use it, data must be imported in the correct format. A wide range standard and non-standard file formats are available for data input (see below).

Numerical Data Input (for correlation analyses)

Numerical Matrix (.csv, .tsv)

Data input format for generating correlation graphs within Graphia Enterprise. Files need to be saved as comma separated files (.csv)

Graphia Enterprise can create a graph structure by calculating a correlation matrix from a numerical data table. In this context nodes represent entities for which there is a range of data variables that describe the behaviour of the entity and edges represent correlations between them. Data tables can be big, i.e. many thousands of rows and columns, and can be supplemented with with other information (attribute data), which relate to known properites of the entities represented in the table. Graphs generated are weighted, non-directional networks.

A correlation-ready data file consists of numerical values arranged in rows and columns (see figure). Each entity is represented by a row, the first column entry being the entity's (node's) ID. Following the entity ID column, there is the option to add node attribute data (light blue columns). This may be numerical or non-numerical information relating to the entities. These are not used to generate the graph but can be used to understand its structure through visualisation and enrichment analysis etc. Ideally, the final attribute column should be non-numerical so as not to confuse the data parser. Similarly, information relating to the columns of data can be added below the coluumn ID row. Following attribute data (where available), is the data table which will be used by the program to generate the correlation matrix. Each cell within the entity row represents a measured variable for that entity. Numerical data columns that would should be arranged together to form a contiguous matrix. Correlation analysis will only be performed on the contiguous numerical section. To aid interpretation of the reult graph, it is highly desireable to order columns (possibly rows) into logical groupings, i.e. based on their attributes.

It should be noted that when loading data, an option is available to transpose the matrix, thereby allowing the user to exmine the similarity between columns, as opposed to rows.

Graph Based

Pairwise formats (.txt)

The simplest Graph format Graphia supports is pairwise.

Pairwise format is a simple way to define a graph. Each line represents an edge. There is no support for additional attributes.

The following pairwise example will create a graph with two edges. NodeA -> Node B -> NodeC. Node names are inferred from the edge definition.

NodeA NodeB
NodeB NodeC

Node names can be escaped with quotes to allow for spaces. Pairwise format optionally allows for an edge weight to be defined.

"Node A" "Node B" 2.3
"Node B" "Node C" 1.5
"Node A" "Node C" 0.5

Biolayout (.layout)

BioPAX OWL ontology (.owl)

Biological Pathway Exchange (BioPAX) is a standard format for sharing biological pathway structures, based on the OWL format. Graphia enterprise supports BioPax Level 3 OWL files.

There are a large number of biological pathways documented online, like Reactome

Graphia Enterprise will create a node for entity within the pathway and an edge for a relationship between them.

Examples of nodes: DNA, RNA, Protein, Gene, BiochemicalReaction etc.

Examples of edges: pathwayComponent, memberPhysicalEntity, controller, controlled, product. etc.

JSON Graph (.json)

JSON Graph is a specification for the definition of Graphs utilising the widely popular JSON format.

Graphia Enterprise can load JSON Graph based graphs.

An example JSON Graph connecting nodes A to B looks like the following:

{
    "graph": {
        "nodes": [
            {
                "id": "A",
            },
            {
                "id": "B",
            }
        ],
        "edges": [
            {
                "source": "A",
                "target": "B"
            }
        ]
    }
}

Node and Edge attributes can be represented inside definitions through the use of a metadata object

{
    "graph": {
        "nodes": [
            {
                "id": "A",
                "metadata": 
                {
                   "A Node Attribute": "Some Value",
                   "Another Node Attribute": "Some other Value"
                },
            },
            {
                "id": "B",
            }
        ],
        "edges": [
            {
                "source": "A",
                "target": "B"
            }
        ]
    }
}

Ensure that all values are surrounded in quotes (ie. Strings) in order to be JSON Graph compliant.

GraphML (.graphml)

GraphML is an XML-style graph format. Graphia supports the loading of GraphML

The following is a simple example of GraphML with two nodes and edges:

<graphml>
    <graph id="G" edgedefault="directed">
        <node id="n0"/>
        <node id="n1"/>
        <edge source="n0" target="n2"/>
        <edge source="n1" target="n2"/>
    </graph>
</graphml>

To set a node name use the desc tag, this is the preferred way in Graphia Enterprise and follows the GraphML specification.

<graphml>
    <graph id="G" edgedefault="directed">
        <node id="n0">
             <desc>Node One</desc>
        </node>
        <node id="n1">
             <desc>Node Two</desc>
        </node>
        <edge source="n0" target="n2"/>
        <edge source="n1" target="n2"/>
    </graph>
</graphml>

In order to add additional node and edge attributes to a GraphML file, they have be declared as a key first. Once the key is declared you can set the value for the edge using a data tag and the relevant id.

<graphml>
    <graph id="G" edgedefault="directed">
        <key id="d0" attr.name="Attribute Name" attr.type="string" for="node"/>
        <node id="n0">
            <data key="d0">Some attribute value</data>
        </node>
        <node id="n1">
            <data key="d0">Some other attribute value</data>
        </node>
        <edge source="n0" target="n2"/>
        <edge source="n1" target="n2"/>
    </graph>
</graphml>

Adjacency Matrix (.matrix, .csv, .tsv)

Graphia can open matrices with tab, comma or semi-colon separators. The file extension should be either .matrix, .csv, or .tsv

A graph can be represented using a matrix, where each row and column reflects a node and each (non-zero) value represents an edge.

The following matrix represents a graph with 5 nodes and 6 edges of edge weight 1.

0,1,0,0,0
0,0,0,1,0
0,1,0,0,1
0,0,1,0,0
1,0,0,0,0

The matrix can also optionally include node names, in this case A-E.

,A,B,C,D,E
A,0,1,0,0,0
B,0,0,0,1,0
C,0,1,0,0,1
D,0,0,1,0,0
E,1,0,0,0,0

Be sure to escape long names with quotes, for example:

"Node One","Node Two","Node Three","Node Four","Node Five"
0,1,0,0,0
0,0,0,1,0
0,1,0,0,1
0,0,1,0,0
1,0,0,0,0

The second set of node names is optional to save repetition.

These examples can be loaded into Graphia Enterprise if saved with the .matrix extension.

Graph Modelling Language (.gml)

GML is a hierarchical text format similar to a simplified JSON. Graphia can load GML files.

Here is a simple example of a GML file, where two nodes are connected via an edge.

graph
[
    node
    [
        id 0
        label "Node One"
    ]
    node
    [
        id 1
        label "Node Two"
    ]
    edge
    [
        source 0
        target 1
    ]
]

Additional attributes can be added as key-value pairs. The key name must be alphanumeric and contain no spaces or punctuation. Keys must begin with an non-number character. String values should be escaped with double quotes. Strings that contain special characters should be HTML encoded

graph
[
    node
    [
        id 0
        label "Invalid"
        // 1_invalid_key is invalid due to punctuation and numeric first character!
        1_invalid_key "Some Value" 
    ]
    node
    [
        id 1
        label "Node Two"
        validKey "Some Value" // This key is correct
    ]
]

More information on GML can be found here

MATLAB Data file (.mat)

Graphia Enterprise supports loading of 2D adjacency matrix/array variables exported from MATLAB.