Skip to content

Extracting text elements with bounding box #374

Answered by topcat30
topcat30 asked this question in Q&A
Discussion options

You must be logged in to vote

Hi, this gets me most of the way. I usually use a GAP of .3 and OrigRow is true. The GAP is basically used to add space between each character, so you can play with it depending on the font size. Things get a bit messy if the fonts change size quite a lot.
Let me know if you find a better way.

oh and you can swap the output to be either spool or lines if you want an xml dump of the text data.

using System;
using System.Text;
using System.IO;
using Org.BouncyCastle.Cms;
using System.Collections.Generic;
using System.Xml;
using System.Xml.Linq;
using System.Linq;
using System.Collections;
using UglyToad.PdfPig;
using UglyToad.PdfPig.DocumentLayoutAnalysis.TextExtractor;
namespace PDFTools
{

Replies: 4 comments 3 replies

Comment options

You must be logged in to vote
1 reply
@EspressoWillie
Comment options

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@An-Aviator
Comment options

Answer selected by topcat30
Comment options

You must be logged in to vote
1 reply
@topcat30
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
4 participants