Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break PDFs into chunks bigger than 1 page? #128

Open
AbeHandler opened this issue Feb 24, 2015 · 3 comments
Open

Break PDFs into chunks bigger than 1 page? #128

AbeHandler opened this issue Feb 24, 2015 · 3 comments

Comments

@AbeHandler
Copy link
Contributor

I just got a very large PDF. I want to break it into smaller PDFs -- but not thousands and thousands of them. Would you be open to a pull request that added this feature to the pages command? Something like

$docsplit pages big.pdf --pages 1-1000 --numoutput 1 #breaks the first 1000 pages into a single file

page_extractor.rb

# Burst a list of pdfs into single pages, as `pdfname_pagenumber.pdf`.
def extract(pdfs, opts)
  extract_options opts
  [pdfs].flatten.each do |pdf|
    pdf_name = File.basename(pdf, File.extname(pdf))
    page_path = ESCAPE[File.join(@output, "#{pdf_name}")] + "_%d.pdf"
    FileUtils.mkdir_p @output unless File.exists?(@output)

    cmd = if DEPENDENCIES[:pdftailor] # prefer pdftailor, but keep pdftk for backwards compatability
      "pdftailor unstitch --output #{page_path} #{ESCAPE[pdf]} 2>&1"
    else
      "pdftk #{ESCAPE[pdf]} burst output #{page_path} 2>&1"
    end
    result = `#{cmd}`.chomp
    FileUtils.rm('doc_data.txt') if File.exists?('doc_data.txt')
    raise ExtractionFailed, result if $? != 0
    result
  end
end
@AbeHandler AbeHandler changed the title Break PDFs into Break PDFs into chunks bigger than 1 page? Feb 24, 2015
@AbeHandler
Copy link
Contributor Author

@knowtheory
Copy link
Member

Hey @AbeHandler,

Yep you're right. Adding page ranges to Docsplit will also require adding them to PDFtailor since PDFtailor just splits a PDF into all of it's constituent pages. If you are interested in adding page ranges to PDFtailor a pull request would be more than welcome!

Although i'm not so down with the --numoutput. My feeling is that if you want pages, use the page subcommand, if you want pdfs we should be talking about the pdf command.

i'm more comfortable with something like docsplit pdf source.pdf --pages 1-5 10-20 30-37 or maybe even docsplit pdf source.pdf --split 1-5 10-20 30-37

@pickhardt
Copy link

Hi, just checking if page ranges have been added? I want to be able to do Docsplit.extract_text(filepath, start_page: 20, end_page: 25) for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants