The capacity to efficiently gather huge amounts of clean data is a critical talent in today's environment.
Web scraping is one of the most essential strategies for accessing the information on a wide scale. This is the process of automatically extracting specific information from a website. It's a game of hide-and-seek. We develop a function, or "bot," that crawls the plugins of websites and extracts data in a manner that we can use in our dataset.
This has been used in a variety of fields. For instance, we could create a bot that retrieves stock values for a specific day, the daily average temperature, or the overall amount of London Metro commuters.
Web scraping allows users to put more characteristics to our data and build richer datasets more quickly. When it comes to medicine, there is ample time to expand datasets. We can, for example, gather information on genetic variants from the internet or enhance databases with possible side effects of medications.
This notebook will show web scraping services by using the Python Selenium module as an example. I'll show you how this works using the following scenario:
Developing code that will fetch pharmacological information from the NHS (National Health and Nutrition Examination Survey) website using electronic prescribing information.
The Code: Prescribed Medication
We will initially start by importing our libraries
#import library import pandas as pd import numpy as np #!pip install selenium from selenium import webdriver
Now, we will import the data set from the website https://www.kaggle.com/dkryan/webscraping/). This website will provide with the list of prescribed medication for every individual in the National Health and Nutrition Examinations Survey.
This is a cross-sectional publicly present health dataset that studies the health and nutrition of a representative sample of the American population.
#import drug chart df = pd.read_csv(‘drug_chart.csv’)
It's important to note that the prescriptions are received as a string, whereas it becomes easy to manage with a list of prescriptions for every single patient.
#prescription list formation def prescription_list(row): """ This function returns a list of all the prescription medication an individual is prescribed""" if row['Prescriptions'] is np.nan: return(np.nan) else: drugs = row['Prescriptions'].split(", ") drugs_list = [] for i in drugs: drugs_list.append(i) return(drugs_list) df['prescription_list'] = df.apply(prescription_list, axis=1)#display df.head()
Selenium is a Python web scraping library that works by automating a Google Chrome page. We can then dynamically manage this to conduct our searches on our behalf. This is created with the help of the function:
driver = webdriver.Chrome("/usr/local/bin/chromedriver")
We may instruct the driver to look for a certain page. It's looking for metformin (a common anti-diabetic medication) on the NHS website.
driver.get("https://www.nhs.uk/medicines/metformin/")
You can see that Selenium has now opened the metformin page on the chrome page that was launched by it. Choose the data you wish to extract from the page, right-click, and copy the exporting path. This transfers the hyperlink to the appropriate HTML code, allowing you to visit that section of the website. The entire process is known as scraping.
For instance, I wish to fetch all the drug data from the NHS website for metformin. Then, this information will be:
"""//*[@id=”about-metformin”]/div"""
Hence, we can get it as:
#scrape metformin = driver.find_element_by_xpath("""//*[@id="about-metformin"]/div""") metformin.text"Metformin is a medicine used to treat type 2 diabetes, and to help prevent type 2 diabetes if you're at high risk of developing it. \nMetformin is used when treating polycystic ovary syndrome (PCOS), although it's not officially approved for PCOS. \nType 2 diabetes is an illness where the body does not make enough insulin, or the insulin that it makes does not work properly. This can cause high blood sugar levels (hyperglycaemia). \nPCOS is a condition that affects how the ovaries work. \nMetformin lowers your blood sugar levels by improving the way your body handles insulin. \nIt's usually prescribed for diabetes when diet and exercise alone have not been enough to control your blood sugar levels. \nFor women with PCOS, metformin lowers insulin and blood sugar levels, and can also stimulate ovulation. \nMetformin is available on prescription as tablets and as a liquid that you drink."
The returned text can then be cleaned up with regex and other modifications (such as replace functions).
metformin.text.replace("\n", " ")
We can now use this data to extract various portions from the NHS website. The drug details for all prescribed medications in the dataset are returned by the NHS details function.
It's vital to note that Selenium assumes that all page structures are identical. As a result, if the website's HTML structure changes, the find element_by_xpath function may fail. To deal with this, I've devised a number of methods for locating drug information, including multiple tries and except clauses. This was found through trial - and - error, as well as knowing the basic Windows fitted of the NHS website.
def nhs_details(drug): drug = drug.lower() try: driver.get(f"https://www.nhs.uk/medicines/{drug}/") section_1 = driver.find_element_by_xpath(f"""//*[@id="about-{drug}"]/div""") section_1_text = section_1.text.replace("\n", " ") section_2 = driver.find_element_by_xpath("""//*[@id="key-facts"]/div""") section_2_text = section_2.text.replace("\n", " ") try: section_3 = driver.find_element_by_xpath(f"""//*[@id="who-can-and-cannot-take-{drug}"]/div""") section_3_text = section_3.text.replace("\n", " ") except: section_3 = driver.find_element_by_xpath(f"""//*[@id="who-can-and-cant-take-{drug}"]/div""") section_3_text = section_3.text.replace("\n", " ") return(section_1_text, section_2_text, section_3_text) except: driver.get(f"https://www.nhs.uk/medicines/{drug}-for-adults/") section_1 = driver.find_element_by_xpath(f"""//*[@id="about-{drug}-for-adults"]/div""") section_1_text = section_1.text.replace("\n", " ") section_2 = driver.find_element_by_xpath("""//*[@id="key-facts"]/div""") section_2_text = section_2.text.replace("\n", " ") section_3 = driver.find_element_by_xpath(f"""//*[@id="who-can-and-cannot-take-{drug}"]/div""") section_3_text = section_3.text.replace("\n", " ") return(section_1_text, section_2_text, section_3_text)
nhs_details('SITAGLIPTIN')('Sitagliptin is a medicine used to treat type 2 diabetes. Type 2 diabetes is an illness where the body does not make enough insulin, or the insulin that it makes does not work properly. This can cause high blood sugar levels (hyperglycaemia). Sitagliptin is prescribed for people who still have high blood sugar, even though they have a sensible diet and exercise regularly. Sitagliptin is only available on prescription. It comes as tablets that you swallow. It also comes as tablets containing a mixture of sitagliptin and metformin. Metformin is another drug used to treat diabetes.', "Sitagliptin works by increasing the amount of insulin that your body makes. Insulin is the hormone that controls sugar levels in your blood. You take sitagliptin once a day. The most common side effect of sitagliptin is headaches. This medicine does not usually make you put on weight. Sitagliptin is also called by the brand name Januvia. When combined with metformin it's called Janumet.", "Sitagliptin can be taken by adults (aged 18 years and older). Sitagliptin is not suitable for some people. To make sure it's safe for you, tell your doctor if you: have had an allergic reaction to sitagliptin or any other medicines in the past have problems with your pancreas have gallstones or very high levels of triglycerides (a type of fat) in your blood are a heavy drinker or dependent on alcohol have (or have previously had) any problems with your kidneys are pregnant or breastfeeding, or trying to get pregnant This medicine is not used to treat type 1 diabetes (when your body does not produce insulin).")
Now, in the categorized NHANES extract, define a program that provides NHS website information for all of the drugs a patient is prescribed.
#build a function that returns information for all medication prescribed def drug_information(patient_number): """webscrapes NHS website and returns drug information"""drugs = df.loc[patient_number]['prescription_list'] print(drugs) for drug in drugs: print('\nPrescription medication:', drug) print('\nAccessing NHS drug information') try: print(nhs_details(drug)) except: print('No NHS details available') drug_information(0)['AMLODIPINE', 'LOSARTAN', 'SIMVASTATIN'] Prescription medication: AMLODIPINE Accessing NHS drug information ('Amlodipine is a medicine used to treat high blood pressure (hypertension). If you have high blood pressure, taking amlodipine helps prevent future heart disease, heart attacks and strokes. Amlodipine is also used to prevent chest pain caused by heart disease (angina). This medicine is only available on prescription. It comes as tablets or as a liquid to swallow. ', "Amlodipine lowers your blood pressure and makes it easier for your heart to pump blood around your body. It's usual to take amlodipine once a day. You can take it at any time of day, but try to make sure it's around the same time each day. The most common side effects include headache, flushing, feeling tired and swollen ankles. These usually improve after a few days. Amlodipine can be called amlodipine besilate, amlodipine maleate or amlodipine mesilate. This is because the medicine contains another chemical to make it easier for your body to take up and use it. It doesn't matter what your amlodipine is called. They all work as well as each other. Amlodipine is also called by the brand names Istin and Amlostin. ", "Amlodipine can be taken by adults and children aged 6 years and over. Amlodipine is not suitable for some people. To make sure amlodipine is safe for you, tell your doctor if you: have had an allergic reaction to amlodipine or any other medicines in the past are trying to get pregnant, are already pregnant or you're breastfeeding have liver or kidney disease have heart failure or you have recently had a heart attack") Prescription medication: LOSARTAN Accessing NHS drug information ("Losartan is a medicine widely used to treat high blood pressure and heart failure, and to protect your kidneys if you have both kidney disease and diabetes. Losartan helps to prevent future strokes, heart attacks and kidney problems. It also improves your survival if you're taking it for heart failure or after a heart attack. This medicine is only available on prescription. It comes as tablets. ", "Losartan lowers your blood pressure and makes it easier for your heart to pump blood around your body. It's often used as a second-choice treatment if you had to stop taking another blood pressure-lowering medicine because it gave you a dry, irritating cough. If you have diarrhoea and vomiting from a stomach bug or illness while taking losartan, tell your doctor. You may need to stop taking it until you feel better. The main side effects of losartan are dizziness and fatigue, but they're usually mild and shortlived. Losartan is not normally recommended in pregnancy or while breastfeeding. Talk to your doctor if you're trying to get pregnant, you're already pregnant or you're breastfeeding. Losartan is also called by the brand name Cozaar. ", "Losartan can be taken by adults aged 18 years and over. Children aged 6 years and older can take it, but only to treat high blood pressure. Your doctor may prescribe losartan if you've tried taking similar blood pressure-lowering medicines such as ramipril and lisinopril in the past, but had to stop taking them because of side effects such as a dry, irritating cough. Losartan isn't suitable for some people. To make sure losartan is safe for you, tell your doctor if you: have had an allergic reaction to losartan or other medicines in the past have diabetes have heart, liver or kidney problems recently had a kidney transplant have had diarrhoea or vomiting have been on a low salt diet have low blood pressure are trying to get pregnant, are already pregnant or you are breastfeeding") Prescription medication: SIMVASTATIN Accessing NHS drug information ("Simvastatin belongs to a group of medicines called statins. It's used to lower cholesterol if you've been diagnosed with high blood cholesterol. It's also taken to prevent heart disease, including heart attacks and strokes. Your doctor may prescribe simvastatin if you have a family history of heart disease, or a long-term health condition such as rheumatoid arthritis, or type 1 or type 2 diabetes. The medicine is available on prescription as tablets. You can also buy a low-strength 10mg tablet from a pharmacy. ", "Simvastatin seems to be a very safe medicine. It's unusual to have any side effects. Keep taking simvastatin even if you feel well, as you will still be getting the benefits. Most people with high cholesterol don't have any symptoms. Do not take simvastatin if you're pregnant, trying to get pregnant or breastfeeding. Do not drink grapefruit juice while you're taking simvastatin. It doesn't mix well with this medicine. Simvastatin is also called Zocor and Simvador. ", "Simvastatin can be taken by adults and children over the age of 10 years. Simvastatin isn't suitable for some people. Tell your doctor if you: have had an allergic reaction to simvastatin or any other medicines in the past have liver or kidney problems are trying to get pregnant, think you might be pregnant, you're already pregnant, or you're breastfeeding have severe lung disease regularly drink large amounts of alcohol have an underactive thyroid have, or have had, a muscle disorder (including fibromyalgia)")
This post explains how to create a simple web data extraction tool for downloading drug information from the NHS website. I've used this template in other scenarios, such as determining the position of 49K gene mutations on chromosomes (single nucleotide polymorphisms).
When data is difficult to collect in clean and user-friendly formats, web-scraping is a useful tool to have on hand.
For any queries, contact us!!!!