Bio-Text Mining for Construction of Biomedical Information Networks

Massive biomedical text data has been generated from research literature, web publication portals, experimental reports and social media.   It is critical but challenging to mine such massive, unstructured, dynamic, noisy and unintegrated data and turn them into structured knowledge.   We propose to develop effective and scalable methods to automatically integrate and transform such biomedical text data into relatively structured biomedical information networks and then develop effective data mining methods to mine such text-rich biomedical networks and generate useful knowledge for KnowEnG and other BD2K center projects. We have been develop multiple innovative and scalable methods for construction and mining of biomedical text-rich information networks, outlined as follows:  (i) phrase mining, including completely unsupervised phrase mining method ToPMine and lightly supervised phrase mining method: SegPhrase; (ii) relation expression clustering-based, distance supervision and multi-strategy integrated optimization framework, ClusType, (iii) meta-path based similarity search, and (iv) heterogeneous network mining.  We have conducted studies on construction and mining of biomedical information networks based on PubMed abstracts with some interesting results.   Some preliminary studies on other kinds of massive text datasets, such as New York Times, Yelp data, Twitter data, and the DBLP research publication datasets have demonstrated the power and high promise of the proposed approach.  We expect more dedicated work on biomedical text mining in the coming months to benefit multiple NIH BD2K centers.


Meng Qu

Jingbo Shang

Jian Peng

Sheng Wang

Xiang Ren

Jialu Liu

Ahmed El-Kishky

Yu Shi

Doris Xin

Henry Lin

Saurabh Sinha

ChengXiang Zhai

Jiawei Han


Jiawei Han