- Add a link to PDF with an external destination
- Add a link with an internal destination to PDF
- Add a note to PDF
- Add barcodes to PDF
- Add bookmarks to PDF
- Add footer to PDF
- Add simple html text to PDF
- Add hyperlink to PDF
- Add Long Term Validation (LTV) data to an existing signature
- Add multiline text to a PDF document
- Add a rubber stamp annotation with a custom icon
- Add single-line text to PDF
- Add Stamp to PDF
- Add tags to existing PDF
- Add text field to PDF
- Add a Diagonal Watermark to PDF in C# - TallComponents - PDF Library
- pdfkit5 - detailed changes to the API - Tall Components
- Append two or more existing PDF files
- Change the color inside a PDF
- Change the formatting of a numeric field
- Change page orientation PDF
- Clip PDF page content in C#
- .NET Core console app on MacOS
- Convert PDF to plain text
- Convert SVG to PDF
- Create a text annotation in PDF with rich text
- Create formfields in PDF documents
- Create a new digitally signed PDF document
- Create rectangles with rounded corners
- Create tagged PDF
- Create text with decorations
- How to create a tiling for shapes in PDF
- Crop content on a PDF page
- Determine the content bounding box
- Determine if a PDF only contains images
- Digitally sign a PDF form in C# or VB.NET
- Disable submit button after submitting
- How to downscale all images in a PDF
- Download and convert image to PDF
- How to downscale all images in a PDF
- Vector graphics in PDF
- Fill XFA form and export XDP data
- Fill and save dynamic XFA form
- Merge XDP data with dynamic XFA form
- Dynamic XFA
- How to embed files in a PDF document
- Embed TrueType font in PDF
- EMF to PDF as vector image
- Export FDF from PDF form
- Extract embedded files from PDF
- Extract glyph boxes from PDF
- Extract glyphs and sort by reading order
- Extract graphics from PDF
- Extract images from PDF
- Fill in a template PDF document
- Fill PDF form
- Fit image to PDF page
- Flatten Markup Annotation
- Flatten PDF form
- How to generate and export certificates
- How do I extract page destinations from bookmarks?
- Highlight fields in PDF
- How to add autosized text to PDF
- How to sign and verify updates to a PDF document
- Import FDF into PDF
- Licensing and .NET Standard
- Merge PDF files in C# .NET
- How to mirror PDF pages and other shapes
- Layout text with MultilineTextShape
- pdfkit5 and .NET Core
- pdfkit5 .NET Standard API
- Read and write meta data from PDF
- Read PDF tags
- How to reduce PDF file size
- Reduce PDF size
- Remove graphics from PDF
- Remove PDF security settings
- Replace field with image
- Resize PDF pages
- Rotate a PDF page
- How to scale content of PDF
- Search text in PDF
- PDF Viewer Preferences
- Create a custom signature handler to sign and verify PDF documents
- Split PDF pages in C# and VB.NET
- Tagged PDF
- TIFF to PDF C#
- Translate PDF page content
- Use multiple licenses
- Use TrueType font collections
- Write Document to HttpResponse
- Use pdfkit5 with a Xamarin.Forms app
- pdfkit5 and Xamarin
Convert PDF to plain text
The following code sample shows how to convert the collection of glyphs on a PDF page to a text string. The algorithm detects spaces, line breaks and overlapping glyphs for visual effects.
Code sample to convert PDF to plain text
using (FileStream fileIn = new FileStream(@"..\..\..\inputdocuments/sometext.pdf", FileMode.Open, FileAccess.Read))
{
Document document = new Document(fileIn);
//get the first page
Page page = document.Pages[0];
//retrieve all glyphs from the current page
//Notice that you grep a strong reference to the glyphs, otherwise the GC can decide to recycle.
GlyphCollection glyphs = page.Glyphs;
//default the glyph collection is ordered as they are present in the PDF file.
//we want them in reading order.
glyphs.Sort();
using (FileStream fileOut = new FileStream(@"..\..\extractedText.txt", FileMode.Create, FileAccess.Write))
{
StreamWriter writer = new StreamWriter(fileOut);
Glyph previousGlyph = null;
foreach (Glyph glyph in glyphs)
{
int spaces = CheckSpaces(previousGlyph, glyph);
for (int i = 0; i < spaces; i++)
{
//insert a space.
writer.Write(" ");
}
if (spaces == -1)
{
//insert an enter.
writer.WriteLine();
}
//insert the characters
foreach (char ch in glyph.Characters)
{
writer.Write(ch);
}
previousGlyph = glyph;
}
writer.Flush();
}
}
Using fileIn As New FileStream("..\..\..\inputdocuments/sometext.pdf", FileMode.Open, FileAccess.Read)
Dim document As New Document(fileIn)
'get the first page
Dim page As Page = document.Pages(0)
'retrieve all glyphs from the current page
'Notice that you grep a strong reference to the glyphs, otherwise the GC can decide to recycle.
Dim glyphs As GlyphCollection = page.Glyphs
'default the glyph collection is ordered as they are present in the PDF file.
'we want them in reading order.
glyphs.Sort()
Using fileOut As New FileStream("..\..\extractedText.txt", FileMode.Create, FileAccess.Write)
Dim writer As New StreamWriter(fileOut)
Dim previousGlyph As Glyph = Nothing
For Each glyph As Glyph In glyphs
Dim spaces As Integer = CheckSpaces(previousGlyph, glyph)
For i As Integer = 0 To spaces - 1
'insert a space.
writer.Write(" ")
Next
If spaces = -1 Then
'insert an enter.
writer.WriteLine()
End If
'insert the characters
For Each ch As Char In glyph.Characters
writer.Write(ch)
Next
previousGlyph = glyph
Next
writer.Flush()
End Using
End Using
//sometimes PDF files don't contain space characters, in this case words are not seperated like so: "word1 word2"
//but you have two Strings "word1" and "word2", where word2 is simply placed further away to simulate a " ".
//to account for this, we must check the positions of each Glyph which is why this function is necessary.
static int CheckSpaces(Glyph firstGlyph, Glyph secondGlyph)
{
if (firstGlyph == null)
{
//there is only 1 glyph to compare.
return 0;
}
if (firstGlyph.BottomLeft.Y != secondGlyph.BottomLeft.Y)
{
//they are not on the same line. (-1 will converted in an enter)
return -1;
}
double spaceBetween = secondGlyph.BottomLeft.X - firstGlyph.BottomRight.X;
if (spaceBetween < 0.1)
{
//[almost] overlapping text.
return 0;
}
double spaceLength = firstGlyph.Font.CalculateWidth(" ", firstGlyph.FontSize);
double spaces = spaceBetween / spaceLength;
return (int)Math.Round(spaces);
}
'sometimes PDF files don't contain space characters, in this case words are not seperated like so: "word1 word2"
'but you have two Strings "word1" and "word2", where word2 is simply placed further away to simulate a " ".
'to account for this, we must check the positions of each Glyph which is why this function is necessary.
Private Function CheckSpaces(firstGlyph As Glyph, secondGlyph As Glyph) As Integer
If firstGlyph Is Nothing Then
'there is only 1 glyph to compare.
Return 0
End If
If firstGlyph.BottomLeft.Y <> secondGlyph.BottomLeft.Y Then
'they are not on the same line. (-1 will converted in an enter)
Return -1
End If
Dim spaceBetween As Double = secondGlyph.BottomLeft.X - firstGlyph.BottomRight.X
If spaceBetween < 0.1 Then
'[almost] overlapping text.
Return 0
End If
Dim spaceLength As Double = firstGlyph.Font.CalculateWidth(" ", firstGlyph.FontSize)
Dim spaces As Double = spaceBetween / spaceLength
Return CInt(Math.Round(spaces))
End Function