Skip to content Skip to sidebar Skip to footer

Using BeautifulSoup To Extract A Table In Python 3

I would like to use BeautifulSoup to extract a table from a website and store it as structured data. The final output I require is something that can be exported to a .csv with a h

Solution 1:

So you already have this:

datasets = [
  (('Tests', '103'), ('Failures', '24'), ('Success Rate', '76.70%'), ('Average Time', '71 ms'), ('Min Time', '0 ms'), ('Max Time', '829 ms')), 
  (('Tests', '109'), ('Failures', '35'), ('Success Rate', '82.01%'), ('Average Time', '12 ms'), ('Min Time', '2 ms'), ('Max Time', '923 ms'))
]

Here's how you can transform it. Assuming all rows are the same, you can extract headers from the first row:

headers_row = [hdr for hdr, data in datasets[0]]

Now, extract the second field of each tuple like ('Tests', '103') in each row:

processed_rows = [
  [data for hdr, data in row]
  for row in datasets
]
# [['103', '24', '76.70%', '71 ms', '0 ms', '829 ms'], ['109', '35', '82.01%', '12 ms', '2 ms', '923 ms']]

Now you have the header row and a list of processed_rows. You can write them to a CSV file with the standard csv module.


A better solution may be to keep your original format and use csv.DictWriter.

  1. Extract the headers into headers_row, as shown above.

  2. Write the data:

    import csv
    
    with open('data.csv', 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames= headers_row)
    
        writer.writeheader()
    
        for row in datasets: # your original data
            writer.writerow(dict(row))
    

Here dict(datasets[0]), for example, is:

{'Tests': '103', 'Failures': '24', 'Success Rate': '76.70%', 'Average Time': '71 ms', 'Min Time': '0 ms', 'Max Time': '829 ms'}

Solution 2:

At the end, just convert your zip iterator to a list:

for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(list(dataset))  # process iterator to list

print(datasets)

Final Output:

[[('Tests', '103'), 
('Failures', '24'), 
('Success Rate', '76.70%'), 
('Average Time', '71 ms'), 
('Min Time', '0 ms'), 
('Max Time', '829 ms')], 

[('Tests', '109'), 
('Failures', '35'), 
('Success Rate', '82.01%'), 
('Average Time', '12 ms'), 
('Min Time', '2 ms'), 
('Max Time', '923 ms')]]

If you want to convert the dataset to a csv string, use this code:

# convert to csv string

hdrline = ','.join(e[0] for e in datasets[0]) + "\n"
data = ""
for rw in datasets:
    data += ','.join([e[1] for e in rw]) + "\n"
    
csvstr = hdrline + data

print(csvstr)

Output:

Tests,Failures,Success Rate,Average Time,Min Time,Max Time
103,24,76.70%,71 ms,0 ms,829 ms
109,35,82.01%,12 ms,2 ms,923 ms

Solution 3:

If you are using the standard csv module, then you don't need to associate values with their labels

You can do the following, assuming you have a csvwriter which can be obtained via https://docs.python.org/3.8/library/csv.html#csv.writer

import csv
...

with open('file.csv', 'w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile) # You may add options here to format your csv file as needed

    headings = [th.get_text() for th in table.find("tr").find_all("th")]

    csvwriter.writerow(headings)

    for row in table.find_all("tr")[1:]:
        data = (td.get_text() for td in row.find_all("td"))
        csvwriter.writerow(data)

Post a Comment for "Using BeautifulSoup To Extract A Table In Python 3"