Importing data¶

by Kozo Nishida, Alexander Pico, Barry Demchak

py4cytoscape 0.0.11

This notebook will show you how to import a pandas.DataFrame of node attributes into Cytoscape as Node Table columns. The same approach works for edge and network attriubutes.

Prerequisites¶

In addition to this package (py4cytoscape), you will need:

Cytoscape 3.8 or greater, which can be downloaded from https://cytoscape.org/download.html. Simply follow the installation instructions on screen.
Complete installation wizard
Launch Cytoscape
If your Cytoscape is 3.8.2 or earlier, install FileTransfer App (Follow here to do it.)

NOTE: To run this notebook, you must manually start Cytoscape first – don’t proceed until you have started Cytoscape.

Setup required only in a remote notebook environment¶

If you’re using a remote Jupyter Notebook environment such as Google Colab, run the cell below. (If you’re running a local Jupyter Notebook server on the desktop machine same with Cytoscape, you don’t need to do that.)

[ ]:

_PY4CYTOSCAPE = 'git+https://github.com/cytoscape/py4cytoscape@0.0.11'
import requests
exec(requests.get("https://raw.githubusercontent.com/cytoscape/jupyter-bridge/master/client/p4c_init.py").text)
IPython.display.Javascript(_PY4CYTOSCAPE_BROWSER_CLIENT_JS) # Start browser client

Note that to use the current py4cytoscape release (instead of v0.0.11), remove the _PY4CYTOSCAPE= line in the snippet above.

Sanity test to verify Cytoscape connection¶

By now, the connection to Cytoscape should be up and available. To verify this, try a simple operation that doesn’t alter the state of Cytoscape.

[1]:

import py4cytoscape as p4c
p4c.cytoscape_ping()
p4c.cytoscape_version_info()

You are connected to Cytoscape!

[1]:

{'apiVersion': 'v1',
 'cytoscapeVersion': '3.8.2',
 'automationAPIVersion': '1.2.0',
 'py4cytoscapeVersion': '0.0.10'}

Always Start with a Network¶

When importing data, you are actually performing a merge function of sorts, appending columns to nodes (or edges) that are present in the referenced network. Data that do not match elements in the network are effectively discarded upon import.

So, in order to demonstrate data import, we first need to have a network. This command will import network files in any of the supported formats (e.g., SIF, GML, XGMML, etc).

In order to import the “SIF” file into Cytoscape, it must be on the local machine where Cytoscape installed, not on Colab. So use the FileTransfer App to send the SIF file to your local file system from Colab.

This operation is necessary even if you are using a local Jupyter Notebook instead of Colab. (This prevents reproducibility problems depending on the file path.)

[2]:

p4c.sandbox_url_to("https://raw.githubusercontent.com/cytoscape/cytoscape-automation/master/for-scripters/Python/data/galFiltered.sif", "galFiltered.sif")

[2]:

{'filePath': 'C:\\Users\\hoge\\CytoscapeConfiguration\\filetransfer\\default_sandbox\\galFiltered.sif',
 'fileByteCount': 6861}

If you are using py4cytoscape in Jupyter Notebook, import_network_from_file will always try to read the file under the sandbox filepath.

[3]:

p4c.import_network_from_file("galFiltered.sif")

[3]:

{'networks': [51], 'views': [754]}

You should now see a network with just over 300 nodes. If you look at the Node Table, you’ll see that there are no attributes other than node names. Let’s fix that…

[4]:

p4c.notebook_export_show_image()

[4]:

../_images/tutorials_Importing-data_9_0.png

Import Data¶

You can import data into Cytoscape from any pandas.DataFrame in Python as long as it contains row names (or an arbitrary column) that match a Node Table column in Cytoscape. In this example, we are starting with a network with yeast identifiers in the “name” column. We also have a CSV file with gene expression data values keyed by yeast identifiers here:

[5]:

import pandas as pd
data = pd.read_csv("https://raw.githubusercontent.com/cytoscape/RCy3/master/inst/extdata/galExpData.csv")

[6]:

data

[6]:

	name	COMMON	gal1RGexp	gal1RGsig	gal4RGexp	gal4RGsig	gal80Rexp	gal80Rsig
0	YDL194W	SNF3	0.139	1.804300e-02	0.333	3.396100e-02	0.449	1.134800e-02
1	YDR277C	MTH1	0.243	2.190000e-05	0.192	2.804400e-02	0.448	5.730000e-04
2	YBR043C	YBR043C	0.454	5.370000e-08	0.023	9.417800e-01	0.000	9.999990e-01
3	YPR145W	ASN1	-0.195	3.170000e-05	-0.614	1.150000e-07	-0.232	1.187300e-03
4	YER054C	GIP2	0.057	1.695800e-01	0.206	6.200000e-04	0.247	4.360300e-03
...	...	...	...	...	...	...	...	...
324	YOR204W	DED1	-0.033	3.994400e-01	-0.056	3.126800e-01	-0.910	8.350000e-16
325	YGL097W	SRM1	0.160	2.191300e-03	-0.230	2.246100e-03	0.008	9.382600e-01
326	YGR218W	CRM1	-0.018	6.138100e-01	-0.001	9.794000e-01	-0.018	8.096900e-01
327	YGL122C	NAB2	0.174	8.730000e-04	0.020	6.170700e-01	0.187	5.996600e-03
328	YKR026C	GCN3	-0.154	9.120000e-04	-0.501	3.570000e-06	0.292	1.122900e-02

329 rows × 8 columns

Note: there may be times where your network and data identifers are of different types. This calls for identifier mapping. py4cytoscape provides a function to perform ID mapping in Cytoscape:

[7]:

?p4c.map_table_column

Signature:
p4c.map_table_column(
    column,
    species,
    map_from,
    map_to,
    force_single=True,
    table='node',
    namespace='default',
    network=None,
    base_url='http://127.0.0.1:1234/v1',
)
Docstring:
Map Table Column.

Perform identifier mapping using an existing column of supported identifiers to populate a new column with
identifiers mapped to the originals.

Supported species: Human, Mouse, Rat, Frog, Zebrafish, Fruit fly, Mosquito, Worm, Arabidopsis thaliana, Yeast,
E. coli, Tuberculosis. Supported identifier types (depending on species): Ensembl, Entrez Gene, Uniprot-TrEMBL,
miRBase, UniGene,  HGNC (symbols), MGI, RGD, SGD, ZFIN, FlyBase, WormBase, TAIR.

Args:
    column (str): Name of column containing identifiers of type specified by ``map.from``
    species (str): Common name for species associated with identifiers, e.g., Human. See details.
    map_from (str): Type of identifier found in specified ``column``. See details.
    map.to (str): Type of identifier to populate in new column. See details.
    force.single (bool): Whether to return only first result in cases of one-to-many mappings; otherwise
        the new column will hold lists of identifiers. Default is TRUE.
    table (str): name of Cytoscape table to load data into, e.g., node, edge or network; default is "node"
    namespace (str): Namespace of table. Default is "default".
    network (SUID or str or None): Name or SUID of a network. Default is the
        "current" network active in Cytoscape.
    base_url (str): Ignore unless you need to specify a custom domain,
        port or version to connect to the CyREST API. Default is http://127.0.0.1:1234
        and the latest version of the CyREST API supported by this version of py4cytoscape.

Returns:
    dataframe: contains map_from and map_to columns.

Warnings:
    If map_to is not unique, it will be suffixed with an incrementing number in parentheses, e.g.,
    if mapIdentifiers is repeated on the same network. However, the original map_to column will be returned regardless.

Raises:
    HTTPError: if table or namespace or table doesn't exist in network
    CyError: if network name or SUID doesn't exist, or if mapping parameter is invalid
    requests.exceptions.RequestException: if can't connect to Cytoscape or Cytoscape returns an error

Examples:
    >>> map_table_column('name','Yeast','Ensembl','SGD')
              name        SGD
    17920  YER145C S000000947
    17921  YMR058W S000004662
    17922  YJL190C S000003726
    ...
File:      c:\users\hoge\miniforge3\lib\site-packages\py4cytoscape\tables.py
Type:      function

Check out the Identifier mapping notebook for detailed examples.

Now we have a pandas.DataFrame that includes our identifiers in a column called “name”, plus a bunch of data columns. Knowing our key columns, we can now perform the import:

[8]:

p4c.get_table_columns()

[8]:

	SUID	shared name	name	selected
512	512	YBR190W	YBR190W	False
514	514	YOL059W	YOL059W	False
516	516	YER102W	YER102W	False
518	518	YOR362C	YOR362C	False
520	520	YMR044W	YMR044W	False
...	...	...	...	...
499	499	YER090W	YER090W	False
501	501	YDR354W	YDR354W	False
503	503	YNL113W	YNL113W	False
504	504	YPR110C	YPR110C	False
506	506	YER103W	YER103W	False

331 rows × 4 columns

[9]:

p4c.load_table_data(data, data_key_column="name")

[9]:

'Success: Data loaded in defaultnode table'

If you look back at the Node Table, you’ll now see that the corresponding rows of our pandas.DataFrame have been imported as new columns.

[10]:

p4c.get_table_columns()

[10]:

	SUID	shared name	name	selected	COMMON	gal1RGexp	gal1RGsig	gal4RGexp	gal4RGsig	gal80Rexp	gal80Rsig
512	512	YBR190W	YBR190W	False	YBR190W	-0.209	0.000045	-0.3	0.000388	0.07	0.66515
514	514	YOL059W	YOL059W	False	GPD2	-0.499	0.0	-0.29	0.000024	-0.591	0.0
516	516	YER102W	YER102W	False	RPS8B	-0.249	0.000033	-0.364	0.000004	-0.135	0.017595
518	518	YOR362C	YOR362C	False	PRE10	0.036	0.31634	-0.043	0.2984	0.225	0.001336
520	520	YMR044W	YMR044W	False	YMR044W	0.255	0.000044	-0.093	0.15055	0.357	0.000882
...	...	...	...	...	...	...	...	...	...	...	...
499	499	YER090W	YER090W	False	TRP2	-0.067	0.079619	-0.38	0.00002	0.231	0.001577
501	501	YDR354W	YDR354W	False	TRP4	-0.122	0.009028	-0.202	0.00282	-0.253	0.001209
503	503	YNL113W	YNL113W	False	RPC19	0.304	0.00001	-0.979	0.0	0.789	0.002894
504	504	YPR110C	YPR110C	False	RPC40	-0.12	0.001961	-0.339	0.000013	-0.026	0.70564
506	506	YER103W	YER103W	False	SSA4	-0.405	0.000124	0.176	0.013281	-0.826	0.0

331 rows × 11 columns

Note: we relied on the default values for table (“node”) and table_key_column (“name”), but these can be specified as well. See help docs for parameter details.

[11]:

?p4c.load_table_data

Signature:
p4c.load_table_data(
    data,
    data_key_column='row.names',
    table='node',
    table_key_column='name',
    namespace='default',
    network=None,
    base_url='http://127.0.0.1:1234/v1',
)
Docstring:
Loads data into Cytoscape tables keyed by row.

This function loads data into Cytoscape node/edge/network
tables provided a common key, e.g., name. Data.frame column names will be
used to set Cytoscape table column names.
Numeric values will be stored as Doubles in Cytoscape tables.
Integer values will be stored as Integers. Character or mixed values will be
stored as Strings. Logical values will be stored as Boolean. Lists are
stored as Lists by CyREST v3.9+. Existing columns with the same names will
keep original type but values will be overwritten.

Args:
    data (dataframe): each row is a node and columns contain node attributes
    data_key_column (str): name of data.frame column to use as key; ' default is "row.names"
    table (str): name of Cytoscape table to load data into, e.g., node, edge or network; default is "node"
    namespace (str): Namespace of table. Default is "default".
    network (SUID or str or None): Name or SUID of a network. Default is the
        "current" network active in Cytoscape.
    base_url (str): Ignore unless you need to specify a custom domain,
        port or version to connect to the CyREST API. Default is http://127.0.0.1:1234
        and the latest version of the CyREST API supported by this version of py4cytoscape.

Returns:
    str: 'Success: Data loaded in <table name> table' or 'Failed to load data: <reason>'

Raises:
    HTTPError: if table or namespace or table doesn't exist in network
    CyError: if network name or SUID doesn't exist
    requests.exceptions.RequestException: if can't connect to Cytoscape or Cytoscape returns an error

Examples:
    >>> data = df.DataFrame(data={'id':['New1','New2','New3'], 'newcol':[1,2,3]})
    >>> load_table_data(data, data_key_column='id', table='node', table_key_column='name')
    'Failed to load data: Provided key columns do not contain any matches'
    >>> data = df.DataFrame(data={'id':['YDL194W','YDR277C','YBR043C'], 'newcol':[1,2,3]})
    >>> load_table_data(data, data_key_column='id', table='node', table_key_column='name', network='galfiltered.sif')
    'Success: Data loaded in defaultnode table'
File:      c:\users\hoge\miniforge3\lib\site-packages\py4cytoscape\tables.py
Type:      function

[ ]: