Beginner's Guide To Build NLP Models Using Medical Text Transcription

Table of Contents

  1. Introduction
  2. Medical Text Transcription Classification Using NLP
  3. Model Building
  4. Challenges of NLP in Medical Text Classification
  5. Conclusion
  6. Frequently Asked Questions

Introduction

In the realm of healthcare, the vast amount of medical data has motivated the need for efficient methods to organize, interpret, and extract valuable insights.

These data, or transcripts, document patient encounters, diagnoses, procedures, and treatment plans. Efficiently processing and organizing this information can be challenging due to its sheer volume and complexity.

Here’s where Natural Language steps come in as a game changer.  NLP techniques allow health care professionals to streamline the transcription process, enhance accuracy, and unlock different information present in medical texts.

Medical Text Transcription Classification using NLP

Medical text transcription classification involves categorizing textual medical records, reports, or notes into predetermined categories or labels.  These categories may include various aspects such as patient demographics, medical conditions, procedures, treatments, and outcomes.  The traditional way of performing this task manually is very time consuming and requires huge resources.  However, with the help of NLP, automated approaches have changed the landscape of medical transcription.

Model Building

In the tutorial, we will be building our own medical transcription classifier. We have used the dataset from Kaggle for building and training our model.

  1. First, we will import all the required libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.manifold import TSNE

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer


2. After importing the libraries, let's read our dataset and see all the columns present in our dataset.

Text_df=pd.read_csv('mtsamples.csv')
Text_df.head()

We have 6 columns:

1) Unnamed: represents the index for each row and does not hold any relevant information specific to the medical transcripts.

2) Description: This column likely contains a brief description of the medical transcript, summarizing the patient's condition, procedures performed, or other relevant details.

3) medical_specialty: This column indicates the area of medicine the transcript pertains to, such as cardiology, oncology, psychiatry, etc.

4) sample_name: This column contains a unique name assigned to the specific transcript sample within the dataset.

5) transcription: This column holds the core content of the dataset, containing the actual text of the medical transcription.

6) keywords: This column contains a list of important keywords or named entities extracted from the transcription text.

3. Now we will try to understand our dataset in  more detail For this, we will create a function that will get us all the unique words (vocabulary) and the sentence count in a list of text.

def unique_sentence_word_count(text_list):
    unique_sent_count = 0
    unique_word_count = 0
    word_freq = {}
    for text in text_list:
        sentences = sent_tokenize(str(text).lower())
        unique_sent_count += len(sentences)
        for sentence in sentences:
            words = word_tokenize(sentence)
            for word in words:
                if word in word_freq:
                    word_freq[word] += 1
                else:
                    word_freq[word] = 1
    unique_word_count = len(word_freq)
    return unique_sent_count, unique_word_count

4. Now let’s print the sentence count and unique word count for each category present in the dataset. We will also be dropping all the columns with null values present in them.

Text_df = Text_df.dropna(subset=['transcription'])

sent_count, word_count = unique_sentence_word_count(Text_df['transcription'].tolist())
print("Sentences in 'transcription' column:", sent_count)

print("Unique words in 'transcription' column:", word_count)



data_categories = Text_df.groupby(Text_df['medical_specialty'])
for i, (catName, dataCategory) in enumerate(data_categories, 1):
    print(f'Cat:{i} {catName} : {len(dataCategory)}')

As we can see from the above output, we have around 40 categories present in the dataset.  But many of the categories have very few or no word count present in it so we will be dropping them .

5. We will be dropping all the categories with a count less than 50.

filtered_categories = data_categories.filter(lambda x: x.shape[0] > 50)
final_categories = filtered_categories.groupby(
    filtered_categories['medical_specialty'])
counter = 1
for category_name, category_data in final_categories:
    print(f'Cat:{counter} {category_name} : {len(category_data)}')
    counter += 1

6. Now let’s visualize the categories using countplot.

plt.figure(figsize=(20,10))
sns.countplot(y='medical_specialty', data=filtered_categories)
plt.show()

7. We will be dropping columns that are not required.

filtered_data = filtered_categories[['transcription', 'medical_specialty']].dropna (subset=['transcription'])

8. Now we will be pre-processing our text using two functions:

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
import string
import re


def clean_text(input_text): 
    clean_text = input_text.translate(
        str.maketrans('', '', string.punctuation))


    clean_text = ''.join([char for char in clean_text if not char.isdigit()])


    clean_text = re.sub('[/(){}\[\]\|@,;]', ' ', clean_text)


    clean_text = clean_text.lower()

    return clean_text


def lemmatize_text(input_text):
    lemmatized_words = []
    lemmatizer = WordNetLemmatizer()
    sentences = sent_tokenize(input_text)


    initial_sentence = sentences[0]
    final_sentence = sentences[-1]

    for sentence in [initial_sentence, final_sentence]:
        words = word_tokenize(sentence)
        lemmatized_words.extend([lemmatizer.lemmatize(word) for word in words])

    return ' '.join(lemmatized_words)


1) Clean_text: This function performs several cleaning steps on the input text like removing punctuation, digits, any certain symbols or spaces. It will also convert all the letters to lowercase. This would reduce noise from the dataset and allow the model to focus on core content of the text.

2) lemmatize_text: This function focuses on lemmatization, which involves reducing words to their base forms, also known as lemmas. For example, "running," "runs," and "ran" would all be converted to the lemma "run."

8. Let’s see some sample data present in our dataset.

sample_transcription_1 = filtered_data.iloc[5]['transcription']
sample_transcription_2 = filtered_data.iloc[125]['transcription']
print('Sample Transcription 1:\n', sample_transcription_1, '\n')
print('Sample Transcription 2:\n', sample_transcription_2, '\n')

9. Now we will be using TF-IDF to effectively convert our text data into a numerical representation that highlights the most informative words for our task, ultimately leading to better model performance and more accurate results.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', ngram_range=(1, 3),max_df=0.75, use_idf=True, smooth_idf=True, max_features=1000)


tfidf_matrix = tfidf_vectorizer.fit_transform(
    filtered_data['transcription'].tolist())


features = sorted(tfidf_vectorizer.get_feature_names())

10. Now we will be creating labels and category list for our dataset.

labels = filtered_data['medical_specialty'].tolist()
category_list = filtered_data.medical_specialty.unique()

11. Now lets split our dataset for model training and testing.

X_train, X_test, y_train, y_test = train_test_split(
tfidf_matrix, labels, stratify=labels, random_state=1)
print('Train_Set_Size:'+str(X_train.shape))
print('Test_Set_Size:'+str(X_test.shape))

12. We will be using Logistic regression algorithm for building our model.

from sklearn.linear_model import LogisticRegression
# Change parameters for Logistic Regression
clf = LogisticRegression(penalty='l2', solver='lbfgs',
random_state=42).fit(X_train, y_train)
y_test_pred = clf.predict(X_test)
print(classification_report(y_test, y_test_pred,labels=category_list))

13. Let’s draw a confusion matrix to see all the predicted labels

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
conf_matrix = confusion_matrix(y_test, y_test_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="greens", cbar=False,
xticklabels=clf.classes_, yticklabels=clf.classes_)
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix')
plt.show()

Challenges of NLP in Medical Text Classification

1) Limited labeled data: Training effective NLP models requires large amounts of labeled data, which can be scarce and expensive to obtain in the medical domain due to privacy concerns and the need for expert annotation.

2) Noise and errors: Medical transcripts may contain errors, typos, abbreviations, and inconsistencies, requiring additional cleaning and normalization steps before processing.

3) Computational complexity: Training and deploying NLP models for large datasets can be computationally expensive, requiring significant resources and infrastructure.

4) Medical terminology: The use of complex medical acronyms, and specialized terms can be challenging for NLP models to understand and interpret accurately.

Conclusion


Medical text classification using NLP offers a powerful approach to unlocking valuable insights from the vast amount of textual data generated in healthcare.

By leveraging techniques like text cleaning, lemmatization, and TF-IDF, we can transform raw transcripts into meaningful representations suitable for machine learning models.

By overcoming the challenges and embracing innovation, NLP has the potential to transform the field of medical transcription, leading the way for more efficient, accurate, and patient-focused healthcare delivery.


Frequently Asked Questions

Q1) What is Natural Language Processing (NLP), and how is it relevant to medical text classification?

NLP is a branch of artificial intelligence concerned with the interaction between computers and human language.

In the context of healthcare, NLP allows automated processing, and classification of medical texts, improving efficiency and accuracy in tasks such as clinical documentation and electronic health record management.

Q2) How do NLP models learn to classify medical texts?

NLP models are trained on large datasets of annotated medical texts, where they learn to recognize patterns, associations, and semantic relationships between words and phrases.

Through techniques such as supervised learning, the models predict input text to predefined categories or labels.

Looking for high quality training data to build NLP models for medical text transcription? Talk to our team to get a tool demo.