Efficient Table Parsing from Wikipedia with Python and BeautifulSoup
To make the code more efficient and effective in parsing tables from Wikipedia, we’ll address the issues with pd.read_html() as mentioned in the question. Here’s a revised version of the code:
import requests
from bs4 import BeautifulSoup
from io import BytesIO
import pandas as pd
def parse_wikipedia_table(url):
# Fetch webpage and create DOM
res = requests.get(url)
tree = BeautifulSoup(res.text, 'html.parser')
# Find table in the webpage
wikitable = tree.find('table', class_='wikitable')
# If no table found, return None
if not wikitable:
return None
# Extract data from the table using XPath
rows = wikitable.find_all('tr')
data = []
for row in rows[1:]:
cols = row.find_all(['th', 'td'])
col_data = [col.text.strip() for col in cols]
if len(col_data) == 2:
col_data.append(None)
data.append(col_data)
return data
def main():
url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
data = parse_wikipedia_table(url)
# Convert the data to a pandas DataFrame
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
if __name__ == '__main__':
main()
This revised code will fetch the webpage, create a DOM using BeautifulSoup, find the table with class wikitable, extract the data from the table, and convert it to a pandas DataFrame. It should be more efficient and effective in parsing tables from Wikipedia compared to the original code.
Last modified on 2023-11-30