Extracting Table Data from Word Document using Aspose Words

For one my projects, I had a requirement where the data from word documents had to be extracted and exported to a database. The biggest challenge was that I had to support the existing word documents. Basically, there were thousands of word documents of same format which had chunks of data. This document format was never designed to be read by another system. This means, no bookmarks, merge fields, styles to identify the actual data from the standard instructions etc.Luckily, for our rescue all the input fields were in the table. But these tables were again of different formats, some with single row/cell and some with varying number.

I use Aspose Words extensively for creating and manipulating word documents. And considering the expertise I had with the component, I decided to go with it. To solve the issue, I created a similar table model in C# so that I can use it later on while reading the documents.

Below, you can see I created a class called WordDocumentTable with three properties i.e. TableID, RowID and ColumnID. As I explained earlier that we had no support for TableID/RowIDs, these properties simply imply the position in the word document. The start index is assumed to be 0.

public class WordDocumentTable
{ 
	public WordDocumentTable(int TableID) 
	{  
		TableID = TableID; 
	}

	public WordDocumentTable(int TableID, int ColumnID) 
	{  
		TableID = TableID;  
		ColumnID = ColumnID; 
	}

	public WordDocumentTable(int TableID, int ColumnID, int RowID) 
	{  
		TableID = TableID;  
		ColumnID = ColumnID;  
		RowID = RowID; 
	}

	private int TableID = 0;

	public int TableID 
	{  
		get { return TableID; }  
		set { TableID = value; } 
	}        

	private int RowID = 0;    
	public int RowID 
	{  
		get { return RowID; }  
		set { RowID = value; } 
	}

	private int ColumnID = 0;    
	public int ColumnID 
	{  
		get { return ColumnID; }  
		set { ColumnID = value; } 
	}

}

Now comes the extraction part. Below, you will see the collection of table cells which I want to read from the document.

private List<WordDocumentTable> WordDocumentTables
{  
	get  
	{    
		List<WordDocumentTable> wordDocTable = new List<WordDocumentTable>();      
		//Reads the data from the first Table of the document.    
		wordDocTable.Add(new WordDocumentTable(0));      
		//Reads the data from the second table and its second column. This table has only one row .    
		wordDocTable.Add(new WordDocumentTable(1, 1));      
		//Reads the data from third table, second row and second cell.    
		wordDocTable.Add(new WordDocumentTable(2, 1, 1));  
		return wordDocTable;  
	}
}

Below is the method which extract the data from Aspose Word Document based on the Table, Row and Cell.

public void ExtractTableData(byte[] PobjData)
{          
	using (MemoryStream Stream = new MemoryStream(PobjData)) 
	{  
		Document AsposeDocument = new Document(Stream);     
		foreach(WordDocumentTable wordDocTable in WordDocumentTables)  
		{   
			Aspose.Words.Tables.Table table = (Aspose.Words.Tables.Table)AsposeDocument.GetChild(NodeType.Table, wordDocTable.TableID, true);   
			string cellData = table.Range.Text;

			if (wordDocTable.ColumnID > 0)   
			{    
				if (wordDocTable.RowID == 0)    
				{     
					NodeCollection Cells = table.GetChildNodes(NodeType.Cell, true);     
					cellData = Cells[wordDocTable.ColumnID].ToTxt();
				}    
				else    
				{     
					NodeCollection Rows = table.GetChildNodes(NodeType.Row, true);     
					cellData = ((Row)(Rows[wordDocTable.RowID])).Cells[wordDocTable.ColumnID].ToTxt();    
				}   
			}

			Console.WriteLine(String.Format("Data in Table {0}, Row {1}, Column {2} : {3}",           
									wordDocTable.TableID,          
									wordDocTable.RowID,          
									wordDocTable.ColumnID,          
									cellData);              
		} 
	}
}