Contact us     |     Newsletter subscription     |     Code of Conduct

Biomedical Data and Text Processing using Shell Scripting


Friday, 4th September


13:30 to 16:30 (CEST)


  • Francisco M. Couto | LASIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal


Besides being almost immutable for more than four decades and being available in most of our computers, shell scripting (command-line) is still one of the most important tools to solve many of the data and text processing challenges that Computational Biologists and Bioinformaticians face in their work.

This tutorial will be a hands-on session to learn how simple command line tools can be used and combined to retrieve, extract and filter data and text from web resources using open standard data file formats, such as TSV, CSV, and XML that can be open by any text editor or spreadsheet application.

Given that it is a 3-hour virtual tutorial, it will only cover some introductory topics, namely Sections 3.5 to 3.8 of the open access book entitled Data and Text Processing for Health and Life Sciences. Participants need  to read and test the examples of previous sections beforehand. More teaching material is freely available at:

Target audience

This tutorial is particularly relevant to Health and Life specialists or students that want an example-based introduction to shell scripting, so they can easily automate some common data and text processing tasks without the need to acquire advanced computer science skills.

Maximum participants

This tutorial is open to at most 20 attendees.


No programming skills are required, but participants need:

  1. A computer with access to internet, text editor and a terminal application.
  2. Check all necessary command tools:
  3. Execute and understand all examples until Section 3.4 of the book


Time (CEST)
13:30-14:00 Data Retrieval (cURL)

Download proteins associated with a compound from ChEBI

14:00-14:30 Data Extraction (grep and gawk)

Select the relevant proteins and their identifiers

14:30-15:00 Task Repetition (xargs)

Download information of multiple proteins fromUniProt

15:00-16:00 XML Processing (xmllint and xpath)

Identify the UniProt entries that represent a Homo sapiens (Human) protein

16:00-16:30 Text Retrieval (cURL, grep, gawk, xargs, xmllint, xpath)

Download the text (titles and abstracts) of the publications associated with a list of proteins