Importing data

Open In Colab

by Kozo Nishida, Alexander Pico, Barry Demchak

py4cytoscape 0.0.11

This notebook will show you how to import a pandas.DataFrame of node attributes into Cytoscape as Node Table columns. The same approach works for edge and network attriubutes.

Prerequisites

In addition to this package (py4cytoscape), you will need:

  • Cytoscape 3.8 or greater, which can be downloaded from https://cytoscape.org/download.html. Simply follow the installation instructions on screen.

  • Complete installation wizard

  • Launch Cytoscape

  • If your Cytoscape is 3.8.2 or earlier, install FileTransfer App (Follow here to do it.)

NOTE: To run this notebook, you must manually start Cytoscape first – don’t proceed until you have started Cytoscape.

Setup required only in a remote notebook environment

If you’re using a remote Jupyter Notebook environment such as Google Colab, run the cell below. (If you’re running a local Jupyter Notebook server on the desktop machine same with Cytoscape, you don’t need to do that.)

[ ]:
_PY4CYTOSCAPE = 'git+https://github.com/cytoscape/py4cytoscape@0.0.11'
import requests
exec(requests.get("https://raw.githubusercontent.com/cytoscape/jupyter-bridge/master/client/p4c_init.py").text)
IPython.display.Javascript(_PY4CYTOSCAPE_BROWSER_CLIENT_JS) # Start browser client

Note that to use the current py4cytoscape release (instead of v0.0.11), remove the _PY4CYTOSCAPE= line in the snippet above.

Sanity test to verify Cytoscape connection

By now, the connection to Cytoscape should be up and available. To verify this, try a simple operation that doesn’t alter the state of Cytoscape.

[1]:
import py4cytoscape as p4c
p4c.cytoscape_ping()
p4c.cytoscape_version_info()
You are connected to Cytoscape!
[1]:
{'apiVersion': 'v1',
 'cytoscapeVersion': '3.8.2',
 'automationAPIVersion': '1.2.0',
 'py4cytoscapeVersion': '0.0.10'}

Always Start with a Network

When importing data, you are actually performing a merge function of sorts, appending columns to nodes (or edges) that are present in the referenced network. Data that do not match elements in the network are effectively discarded upon import.

So, in order to demonstrate data import, we first need to have a network. This command will import network files in any of the supported formats (e.g., SIF, GML, XGMML, etc).

In order to import the “SIF” file into Cytoscape, it must be on the local machine where Cytoscape installed, not on Colab. So use the FileTransfer App to send the SIF file to your local file system from Colab.

This operation is necessary even if you are using a local Jupyter Notebook instead of Colab. (This prevents reproducibility problems depending on the file path.)

[2]:
p4c.sandbox_url_to("https://raw.githubusercontent.com/cytoscape/cytoscape-automation/master/for-scripters/Python/data/galFiltered.sif", "galFiltered.sif")
[2]:
{'filePath': 'C:\\Users\\hoge\\CytoscapeConfiguration\\filetransfer\\default_sandbox\\galFiltered.sif',
 'fileByteCount': 6861}

If you are using py4cytoscape in Jupyter Notebook, import_network_from_file will always try to read the file under the sandbox filepath.

[3]:
p4c.import_network_from_file("galFiltered.sif")
[3]:
{'networks': [51], 'views': [754]}

You should now see a network with just over 300 nodes. If you look at the Node Table, you’ll see that there are no attributes other than node names. Let’s fix that…

[4]:
p4c.notebook_export_show_image()
[4]:
../_images/tutorials_Importing-data_9_0.png

Import Data

You can import data into Cytoscape from any pandas.DataFrame in Python as long as it contains row names (or an arbitrary column) that match a Node Table column in Cytoscape. In this example, we are starting with a network with yeast identifiers in the “name” column. We also have a CSV file with gene expression data values keyed by yeast identifiers here:

[5]:
import pandas as pd
data = pd.read_csv("https://raw.githubusercontent.com/cytoscape/RCy3/master/inst/extdata/galExpData.csv")
[6]:
data
[6]:
name COMMON gal1RGexp gal1RGsig gal4RGexp gal4RGsig gal80Rexp gal80Rsig
0 YDL194W SNF3 0.139 1.804300e-02 0.333 3.396100e-02 0.449 1.134800e-02
1 YDR277C MTH1 0.243 2.190000e-05 0.192 2.804400e-02 0.448 5.730000e-04
2 YBR043C YBR043C 0.454 5.370000e-08 0.023 9.417800e-01 0.000 9.999990e-01
3 YPR145W ASN1 -0.195 3.170000e-05 -0.614 1.150000e-07 -0.232 1.187300e-03
4 YER054C GIP2 0.057 1.695800e-01 0.206 6.200000e-04 0.247 4.360300e-03
... ... ... ... ... ... ... ... ...
324 YOR204W DED1 -0.033 3.994400e-01 -0.056 3.126800e-01 -0.910 8.350000e-16
325 YGL097W SRM1 0.160 2.191300e-03 -0.230 2.246100e-03 0.008 9.382600e-01
326 YGR218W CRM1 -0.018 6.138100e-01 -0.001 9.794000e-01 -0.018 8.096900e-01
327 YGL122C NAB2 0.174 8.730000e-04 0.020 6.170700e-01 0.187 5.996600e-03
328 YKR026C GCN3 -0.154 9.120000e-04 -0.501 3.570000e-06 0.292 1.122900e-02

329 rows × 8 columns

Note: there may be times where your network and data identifers are of different types. This calls for identifier mapping. py4cytoscape provides a function to perform ID mapping in Cytoscape:

[7]:
?p4c.map_table_column
Signature:
p4c.map_table_column(
    column,
    species,
    map_from,
    map_to,
    force_single=True,
    table='node',
    namespace='default',
    network=None,
    base_url='http://127.0.0.1:1234/v1',
)
Docstring:
Map Table Column.

Perform identifier mapping using an existing column of supported identifiers to populate a new column with
identifiers mapped to the originals.

Supported species: Human, Mouse, Rat, Frog, Zebrafish, Fruit fly, Mosquito, Worm, Arabidopsis thaliana, Yeast,
E. coli, Tuberculosis. Supported identifier types (depending on species): Ensembl, Entrez Gene, Uniprot-TrEMBL,
miRBase, UniGene,  HGNC (symbols), MGI, RGD, SGD, ZFIN, FlyBase, WormBase, TAIR.

Args:
    column (str): Name of column containing identifiers of type specified by ``map.from``
    species (str): Common name for species associated with identifiers, e.g., Human. See details.
    map_from (str): Type of identifier found in specified ``column``. See details.
    map.to (str): Type of identifier to populate in new column. See details.
    force.single (bool): Whether to return only first result in cases of one-to-many mappings; otherwise
        the new column will hold lists of identifiers. Default is TRUE.
    table (str): name of Cytoscape table to load data into, e.g., node, edge or network; default is "node"
    namespace (str): Namespace of table. Default is "default".
    network (SUID or str or None): Name or SUID of a network. Default is the
        "current" network active in Cytoscape.
    base_url (str): Ignore unless you need to specify a custom domain,
        port or version to connect to the CyREST API. Default is http://127.0.0.1:1234
        and the latest version of the CyREST API supported by this version of py4cytoscape.

Returns:
    dataframe: contains map_from and map_to columns.

Warnings:
    If map_to is not unique, it will be suffixed with an incrementing number in parentheses, e.g.,
    if mapIdentifiers is repeated on the same network. However, the original map_to column will be returned regardless.

Raises:
    HTTPError: if table or namespace or table doesn't exist in network
    CyError: if network name or SUID doesn't exist, or if mapping parameter is invalid
    requests.exceptions.RequestException: if can't connect to Cytoscape or Cytoscape returns an error

Examples:
    >>> map_table_column('name','Yeast','Ensembl','SGD')
              name        SGD
    17920  YER145C S000000947
    17921  YMR058W S000004662
    17922  YJL190C S000003726
    ...
File:      c:\users\hoge\miniforge3\lib\site-packages\py4cytoscape\tables.py
Type:      function

Check out the Identifier mapping notebook for detailed examples.

Now we have a pandas.DataFrame that includes our identifiers in a column called “name”, plus a bunch of data columns. Knowing our key columns, we can now perform the import:

[8]:
p4c.get_table_columns()
[8]:
SUID shared name name selected
512 512 YBR190W YBR190W False
514 514 YOL059W YOL059W False
516 516 YER102W YER102W False
518 518 YOR362C YOR362C False
520 520 YMR044W YMR044W False
... ... ... ... ...
499 499 YER090W YER090W False
501 501 YDR354W YDR354W False
503 503 YNL113W YNL113W False
504 504 YPR110C YPR110C False
506 506 YER103W YER103W False

331 rows × 4 columns

[9]:
p4c.load_table_data(data, data_key_column="name")
[9]:
'Success: Data loaded in defaultnode table'

If you look back at the Node Table, you’ll now see that the corresponding rows of our pandas.DataFrame have been imported as new columns.

[10]:
p4c.get_table_columns()
[10]:
SUID shared name name selected COMMON gal1RGexp gal1RGsig gal4RGexp gal4RGsig gal80Rexp gal80Rsig
512 512 YBR190W YBR190W False YBR190W -0.209 0.000045 -0.3 0.000388 0.07 0.66515
514 514 YOL059W YOL059W False GPD2 -0.499 0.0 -0.29 0.000024 -0.591 0.0
516 516 YER102W YER102W False RPS8B -0.249 0.000033 -0.364 0.000004 -0.135 0.017595
518 518 YOR362C YOR362C False PRE10 0.036 0.31634 -0.043 0.2984 0.225 0.001336
520 520 YMR044W YMR044W False YMR044W 0.255 0.000044 -0.093 0.15055 0.357 0.000882
... ... ... ... ... ... ... ... ... ... ... ...
499 499 YER090W YER090W False TRP2 -0.067 0.079619 -0.38 0.00002 0.231 0.001577
501 501 YDR354W YDR354W False TRP4 -0.122 0.009028 -0.202 0.00282 -0.253 0.001209
503 503 YNL113W YNL113W False RPC19 0.304 0.00001 -0.979 0.0 0.789 0.002894
504 504 YPR110C YPR110C False RPC40 -0.12 0.001961 -0.339 0.000013 -0.026 0.70564
506 506 YER103W YER103W False SSA4 -0.405 0.000124 0.176 0.013281 -0.826 0.0

331 rows × 11 columns

Note: we relied on the default values for table (“node”) and table_key_column (“name”), but these can be specified as well. See help docs for parameter details.

[11]:
?p4c.load_table_data
Signature:
p4c.load_table_data(
    data,
    data_key_column='row.names',
    table='node',
    table_key_column='name',
    namespace='default',
    network=None,
    base_url='http://127.0.0.1:1234/v1',
)
Docstring:
Loads data into Cytoscape tables keyed by row.

This function loads data into Cytoscape node/edge/network
tables provided a common key, e.g., name. Data.frame column names will be
used to set Cytoscape table column names.
Numeric values will be stored as Doubles in Cytoscape tables.
Integer values will be stored as Integers. Character or mixed values will be
stored as Strings. Logical values will be stored as Boolean. Lists are
stored as Lists by CyREST v3.9+. Existing columns with the same names will
keep original type but values will be overwritten.

Args:
    data (dataframe): each row is a node and columns contain node attributes
    data_key_column (str): name of data.frame column to use as key; ' default is "row.names"
    table (str): name of Cytoscape table to load data into, e.g., node, edge or network; default is "node"
    namespace (str): Namespace of table. Default is "default".
    network (SUID or str or None): Name or SUID of a network. Default is the
        "current" network active in Cytoscape.
    base_url (str): Ignore unless you need to specify a custom domain,
        port or version to connect to the CyREST API. Default is http://127.0.0.1:1234
        and the latest version of the CyREST API supported by this version of py4cytoscape.

Returns:
    str: 'Success: Data loaded in <table name> table' or 'Failed to load data: <reason>'

Raises:
    HTTPError: if table or namespace or table doesn't exist in network
    CyError: if network name or SUID doesn't exist
    requests.exceptions.RequestException: if can't connect to Cytoscape or Cytoscape returns an error

Examples:
    >>> data = df.DataFrame(data={'id':['New1','New2','New3'], 'newcol':[1,2,3]})
    >>> load_table_data(data, data_key_column='id', table='node', table_key_column='name')
    'Failed to load data: Provided key columns do not contain any matches'
    >>> data = df.DataFrame(data={'id':['YDL194W','YDR277C','YBR043C'], 'newcol':[1,2,3]})
    >>> load_table_data(data, data_key_column='id', table='node', table_key_column='name', network='galfiltered.sif')
    'Success: Data loaded in defaultnode table'
File:      c:\users\hoge\miniforge3\lib\site-packages\py4cytoscape\tables.py
Type:      function

[ ]: