- Add Long Term Validation (LTV) data to an existing signature
- How to downscale all images in a PDF
- How to generate and export certificates
- How to downscale all images in a PDF
- Add Stamp to PDF
- How to reduce PDF file size
- Highlight fields in PDF
- Add a note to PDF
- Resize PDF pages
- Extract glyph boxes from PDF
- Use TrueType font collections
- Layout text with MultilineTextShape
- Merge PDF files in C# .NET
- How do I extract page destinations from bookmarks?
- Clip PDF page content in C#
- Fill PDF form
- Extract glyphs and sort by reading order
- Add bookmarks to PDF
- How to scale content of PDF
- Create rectangles with rounded corners
- Create text with decorations
- TIFF to PDF C#
- Crop content on a PDF page
- How to embed files in a PDF document
- Remove graphics from PDF
- Change the color inside a PDF
- Import FDF into PDF
- Flatten PDF form
- Digitally sign a PDF form in C# or VB.NET
- Vector graphics in PDF
- Translate PDF page content
- Extract graphics from PDF
- Determine the content bounding box
- Search text in PDF
- Convert PDF to plain text
- Flatten Markup Annotation
- Add text field to PDF
- Extract embedded files from PDF
- Extract images from PDF
- Add a Diagonal Watermark to PDF in C#
- Fit image to PDF page
- Add simple html text to PDF
- Add multiline text to a PDF document
- Add single-line text to PDF
- Create a new digitally signed PDF document
- PDF Viewer Preferences
- Change page orientation PDF
- Split PDF pages in C# and VB.NET
- Append two or more existing PDF files
- Determine if a PDF only contains images
- Add footer to PDF
- Convert SVG to PDF
- Add hyperlink to PDF
- Rotate a PDF page
- Change the formatting of a numeric field
- How to mirror PDF pages and other shapes
- Fill in a template PDF document
- How to add autosized text to PDF
- Create formfields in PDF documents
- Export FDF from PDF form
- Add a link with an internal destination to PDF
- Remove PDF security settings
- Add a link to PDF with an external destination
- How to sign and verify updates to a PDF document
- Embed TrueType font
- How to create a tiling for shapes in PDF
- EMF to PDF as vector image
- Replace field with image
- Add a rubber stamp annotation with a custom icon
- Create a text annotation in PDF with rich text
- Read and write meta data from PDF
- Create a custom signature handler to sign and verify PDF documents
- Download and convert image to PDF
- Add barcodes to PDF
- Tagged PDF
- PDFKit.NET 5.0 - detailed changes to the API
- Create tagged PDF
- PDFKit.NET 5.0 and .NET Core
- PDFKit.NET 5.0 and Xamarin
- Dynamic XFA
- PDFKit.NET 5.0 .NET Standard API
- .NET Core console app on MacOS
- Add tags to existing PDF
- Read PDF tags
- Merge XDP data with dynamic XFA form
- Fill XFA form and export XDP data
- Fill and save dynamic XFA form
- Use PDFKit.NET 5.0 with a Xamarin.Forms app
- Use multiple licenses
- Licensing and .NET Standard
- Reduce PDF size
- Disable submit button after submitting
- Write Document to HttpResponse
Extract glyphs and sort by reading order
The PDF format is designed to be a finished document after all the editing is done. It is intended for printing, viewing and interacting (such as clicking links or filling fields), but not for further editing. However the extraction of text is possible, although with a few caveats, which this article will address.
Text extraction: default ordering
The primary purpose of PDF is the presentation of information to humans and not to provide computer readable information. Most PDF documents do not contain layout information such as columns, tables and paragraphs. The textual content is basically just a collection of glyphs, each with its own X and Y coordinates. To extract text this collection must be ordered somehow, usually by sorting the glyphs on their X and Y position. This assumes that the glyphs on the top-left are the first and on the bottom right are the last in the reading order. This is how PDFKit.NET orders the glyphs as returned by page.Glyphs by default:
Document document = new Document(fileIn);
Page page = document.Pages[0];
GlyphCollection glyphs = page.Glyphs;
glyphs.Sort();
foreach (Glyph glyph in glyphs)
{
// glyphs sorted from top-left to bottom right here...
}
Dim document As New Document(fileIn)
Dim page As Page = document.Pages(0)
Dim glyphs As GlyphCollection = page.Glyphs
glyphs.Sort()
' glyphs sorted from top-left to bottom right here...
For Each glyph As Glyph In glyphs
Next
In most cases, this gives the expected result. This fails however in case of superscript text like this:
The problem here is that in both the text ‘abc’ and ‘xyz’ are sorted first on their vertical position. So what we get as result is:
Although ‘abc’ comes after ‘superscript’ in the reading order, it is returned first.
Text extraction: custom ordering
To address this problem we can write a custom sorter that is used this way:
Document document = new Document(fileIn);
Page page = document.Pages[0];
GlyphCollection glyphs = page.Glyphs;
var glyphSorter = new SortGlyphs();
glyphs.Sort(glyphSorter, true);
…
Dim document As New Document(fileIn)
Dim page As Page = document.Pages(0)
Dim glyphs As GlyphCollection = page.Glyphs
Dim glyphSorter = New SortGlyphs()
glyphs.Sort(glyphSorter, True)
The implementation of SortGlyphs:
internal class SortGlyphs : IGlyphComparer
{
private int minimalPercentageForGlyphsToOverlapVertically = 40; // in percents
public struct glyphArea
{
public float X;
public float Y;
public float Width;
public float Height;
public bool HasIllegalValue;
}
// Find out whether glayp A or B should come first in the output
public int Compare(Glyph glyphA, Glyph glyphB)
{
var areaA = getGlyphArea(glyphA);
var areaB = getGlyphArea(glyphB);
// default compare result:
// If nothing special and not about on the same Y position, return:
// 1 if glyph A is positioned higher than glyph B, meaning A is first
// -1 if glyph A is below glyph B
var compare = (areaA.Y < areaB.Y) ? 1 : -1; // default: -1 = a < b, 1 = a > b
// Below, we account for characters with a negative height. Usually, this
// means that the characters are upside down. Just make sure that these characters
// are considered smaller than the rest, and equal amongst themselves.
if (areaA.HasIllegalValue)
{
compare = (areaB.HasIllegalValue) ? 0 : -1;
}
else if (areaA.HasIllegalValue)
{
compare = 1;
}
else
{
// If here, the glyphs can be compared by their X and Y position,
// First look if the glyphs are about on the same height
// by checking if there is any overlap...
var yOverlapTop = Math.Min(areaA.Y + areaA.Height, areaB.Y + areaB.Height);
var yOverlapBottom = Math.Max(areaA.Y, areaB.Y);
if (yOverlapTop > yOverlapBottom)
{
// Found that A and B have at least a bit of a vertical overlap:
// Now check if there is a a minimal amount to overlap or that they are just
// touching each other.
// Calculate the overlap in percents of the shortest one, in this way it is
// more likely to match the sub- and superscript glyphs with the rest of the line
var yOverlap = yOverlapTop - yOverlapBottom;
var yOverlapPercent = 100.00 * (yOverlap / Math.Min(areaA.Height, areaB.Height));
if (yOverlapPercent > this.minimalPercentageForGlyphsToOverlapVertically)
{
// Overlapping enough: take a look at the x-position and mark the most
// left sided first
compare = (areaA.X > areaB.X) ? 1 : // a > b
(areaA.X < areaB.X) ? -1 : // a < b
(areaA.Width > areaB.Width) ? 1 : // a > b
(areaA.Width < areaB.Width) ? -1 : // a < b
0 ; // a == b
}
}
}
return compare;
}
// Get the area around the glyph, possibly changed for special characters
private static glyphArea getGlyphArea(Glyph glyph)
{
var area = new glyphArea()
{
X = glyph.BottomLeft.X,
Y = glyph.BottomLeft.Y,
Width = glyph.TopRight.X - glyph.TopLeft.X,
Height = glyph.TopRight.Y - glyph.BottomRight.Y,
HasIllegalValue = false
};
area.HasIllegalValue = (area.Height < 0) || (area.Width < 0);
// This gives a possibily to tweak the dimensions of some special characters
// like the underscore '_'.
if (glyph.Characters.Length == 1)
{
switch (glyph.Characters[0])
{
case '_': //underscore
float dY = (float)(glyph.TopLeft.Y - glyph.BottomLeft.Y);
dY *= 0.4f; //the correction will be 40%, so 60% (height) will be left.
break;
}
}
return area;
}
}
Friend Class SortGlyphs
Implements IGlyphComparer
Private minimalPercentageForGlyphsToOverlapVertically As Integer = 40
' in percents
Public Structure glyphArea
Public X As Single
Public Y As Single
Public Width As Single
Public Height As Single
Public HasIllegalValue As Boolean
End Structure
' Find out whether glayp A or B should come first in the output
Public Function Compare(glyphA As Glyph, glyphB As Glyph) As Integer Implements IGlyphComparer.Compare
Dim areaA = getGlyphArea(glyphA)
Dim areaB = getGlyphArea(glyphB)
' default compare result:
' If nothing special and not about on the same Y position, return:
' 1 if glyph A is positioned higher than glyph B, meaning A is first
' -1 if glyph A is below glyph B
Dim compare__1 = If((areaA.Y < areaB.Y), 1, -1)
' default: -1 = a < b, 1 = a > b
' Below, we account for characters with a negative height. Usually, this
' means that the characters are upside down. Just make sure that these characters
' are considered smaller than the rest, and equal amongst themselves.
If areaA.HasIllegalValue Then
compare__1 = If((areaB.HasIllegalValue), 0, -1)
ElseIf areaA.HasIllegalValue Then
compare__1 = 1
Else
' If here, the glyphs can be compared by their X and Y position,
' First look if the glyphs are about on the same height
' by checking if there is any overlap...
Dim yOverlapTop = Math.Min(areaA.Y + areaA.Height, areaB.Y + areaB.Height)
Dim yOverlapBottom = Math.Max(areaA.Y, areaB.Y)
If yOverlapTop > yOverlapBottom Then
' Found that A and B have at least a bit of a vertical overlap:
' Now check if there is a a minimal amount to overlap or that they are just
' touching each other.
' Calculate the overlap in percents of the shortest one, in this way it is
' more likely to match the sub- and superscript glyphs with the rest of the line
Dim yOverlap = yOverlapTop - yOverlapBottom
Dim yOverlapPercent = 100.0 * (yOverlap / Math.Min(areaA.Height, areaB.Height))
If yOverlapPercent > Me.minimalPercentageForGlyphsToOverlapVertically Then
' Overlapping enough: take a look at the x-position and mark the most
' left sided first
' a > b
' a < b
' a > b
' a < b
' a == b
compare__1 = If((areaA.X > areaB.X), 1, If((areaA.X < areaB.X), -1, If((areaA.Width > areaB.Width), 1, If((areaA.Width < areaB.Width), -1, 0))))
End If
End If
End If
Return compare__1
End Function
' Get the area around the glyph, possibly changed for special characters
Private Shared Function getGlyphArea(glyph As Glyph) As glyphArea
Dim area = New glyphArea() With {
.X = glyph.BottomLeft.X,
.Y = glyph.BottomLeft.Y,
.Width = glyph.TopRight.X - glyph.TopLeft.X,
.Height = glyph.TopRight.Y - glyph.BottomRight.Y,
.HasIllegalValue = False
}
area.HasIllegalValue = (area.Height < 0) OrElse (area.Width < 0)
' This gives a possibily to tweak the dimensions of some special characters
' like the underscore '_'.
If glyph.Characters.Length = 1 Then
Select Case glyph.Characters(0)
Case "_"c
'underscore
Dim dY As Single = CSng(glyph.TopLeft.Y - glyph.BottomLeft.Y)
dY *= 0.4F
'the correction will be 40%, so 60% (height) will be left.
Exit Select
End Select
End If
Return area
End Function
End Class
This sorter considers glyphs to be on the same line if they overlap 40% or more.
Multiple columns
As stated beFore, the text in a PDF document is just an unordered collection of glyphs. Extracting the text becomes much more difficult if the layout of the PDF is not ordered in a very simple way, e.g. the PDF can have multi columns. We do not try to solve these kind of problems, but we help how to recognize columns.

We have sent an email with a download link.