Add tags to existing PDF
Add tags to existing PDF
This article demonstrates how to open an existing PDF document, read the visual content and tag the visual content based on its type and content. In particular, text shapes are tagged as a H1
if the text equals “Creating tagged PDF” and as Span
otherwise. Image shapes are tagged as Figure
. This code sample is included in the evaluation download.
Here is the end result:
First we open an existing PDF document, and extract the visual content as shapes:
using (FileStream fs = new FileStream("NotTagged.pdf", FileMode.Open))
{
Document document = new Document(fs);
Page sourcePage = document.Pages[0];
ShapeCollection shapes = sourcePage.CreateShapes();
...
}
Next we create a new tagged document and setup the root hierarchy:
// create a new tagged document and setup the root hierarchy
Document taggedDocument = new Document();
taggedDocument.LogicalStructure = new LogicalStructure();
Tag documentTag = new Tag("Document", taggedDocument.LogicalStructure.RootTag);
Tag paragraphTag = new Tag("P", documentTag);
// copy the visual content to the new document
taggedDocument.Pages.Add(new Page(sourcePage.Width, sourcePage.Height));
taggedDocument.Pages[0].Overlay.Add(shapes);
SetTag(shapes, paragraphTag);
Method SetTag enumerates the shapes and tags them accordingly as follows:
static void SetTag(Shape shape, Tag paragraphTag)
{
// tag text
TextShape textShape = shape as TextShape;
if (null != textShape)
{
if (textShape.Text == "Creating tagged PDF")
textShape.ParentTag = new Tag("H1", paragraphTag);
else
textShape.ParentTag = new Tag("Span", paragraphTag);
return;
}
// txt images
ImageShape imageShape = shape as ImageShape;
if (null != imageShape)
{
imageShape.ParentTag = new Tag("Figure", paragraphTag);
return;
}
// recurse
ShapeCollection shapeCollection = shape as ShapeCollection;
if (null != shapeCollection)
{
foreach(Shape item in shapeCollection)
{
SetTag(item, paragraphTag);
}
}
}