-
NTB - T10
Biomedical Data and Text Processing using Shell Scripting
Date:
Friday, 4th September
Time:
13:30 to 16:30 (CEST)
Instructor
- Francisco M. Couto | LASIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal
Summary
Besides being almost immutable for more than four decades and being available in most of our computers, shell scripting (command-line) is still one of the most important tools to solve many of the data and text processing challenges that Computational Biologists and Bioinformaticians face in their work.
This tutorial will be a hands-on session to learn how simple command line tools can be used and combined to retrieve, extract and filter data and text from web resources using open standard data file formats, such as TSV, CSV, and XML that can be open by any text editor or spreadsheet application.
Given that it is a 3-hour virtual tutorial, it will only cover some introductory topics, namely Sections 3.5 to 3.8 of the open access book entitled Data and Text Processing for Health and Life Sciences. Participants need to read and test the examples of previous sections beforehand. More teaching material is freely available at: http://labs.rd.ciencias.ulisboa.pt/book/.
Target audience
This tutorial is particularly relevant to Health and Life specialists or students that want an example-based introduction to shell scripting, so they can easily automate some common data and text processing tasks without the need to acquire advanced computer science skills.
Maximum participants
This tutorial is open to at most 20 attendees.
Requirements
No programming skills are required, but participants need:
- A computer with access to internet, text editor and a terminal application.
- Check all necessary command tools:
- Linux: https://youtu.be/4cO9vvbxWUU
- MacOS: https://youtu.be/zw7Nd67_aFw
- Windows alternatives:
- MobaXterm: https://youtu.be/yI1No5_o-Kw
- Windows Subsystem for Linux: https://youtu.be/VNlksnYDE0Y
- Execute and understand all examples until Section 3.4 of the book
Schedule
Time (CEST) |
Details
|
13:30-14:00 | Data Retrieval (cURL)
Download proteins associated with a compound from ChEBI |
14:00-14:30 | Data Extraction (grep and gawk)
Select the relevant proteins and their identifiers |
14:30-15:00 | Task Repetition (xargs)
Download information of multiple proteins fromUniProt |
15:00-16:00 | XML Processing (xmllint and xpath)
Identify the UniProt entries that represent a Homo sapiens (Human) protein |
16:00-16:30 | Text Retrieval (cURL, grep, gawk, xargs, xmllint, xpath)
Download the text (titles and abstracts) of the publications associated with a list of proteins |