Extracting tabular data from multiple xlsx files with inconsistent structures

Question

I have several Excel spreadsheets with a variety of information, including tables. From every file, I must be able to access a specific table. I thought of using pandas skiprows, but each sheet has a different line where the table is located, both at the beginning and end of the sheet. For the sake of the example below, I need to retrieve the final table with the headers "Well," "Content," etc. and convert it to a dataframe. To be consistent, the table's row in this example is 115; nevertheless, this varies from file to file. The distance from the sheet's finish is also erratic and irregular. Any help is much appreciated!

enter image description here

I have looked into openpyxl but didn't find anything which would isolate the table based on header values. I have also looked into pd.read_excel skiprows and/or indexing the dataframe using iloc. The issue here is the inconsistent position of the table, and variable size of the table.

narikkadan · Answer 1 · Mar 25, 2023

For a cell that is consistent across all files (in this case, "Basic Settings" is always the same distance away from the bottom of the table), I was able to solve this problem by first getting the index of the first row and then the index of the nearest row below. As follows:

#defining the header index by getting the first column heading
header_index = raw_table[raw_table[0].eq('Well')].index.values[0]
#defining the footer dimensions based on the first consistent title 
footer_index = raw_table[raw_table[0].eq('Basic settings')].index.values[0]
#Slicing the table according to the indices determined above, the footer is *strong text*3 
#below the end, so subtracting 3 
cropped_table = raw_table[header_index:footer_index-3]'''