Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Styling in Word versions of documents #24

Open
ghost opened this issue Oct 30, 2014 · 14 comments
Open

Styling in Word versions of documents #24

ghost opened this issue Oct 30, 2014 · 14 comments

Comments

@ghost
Copy link

ghost commented Oct 30, 2014

I noticed that the Microsoft Word versions of these documents have room for improvement with respect to styling. Is it possible to crowdsource the Word files themselves, or are we only doing substantive text? Thanks!

@kemitchell
Copy link

The trouble with tracking .docx is that Git will treat them as binary files. Neither Git nor GitHub can diff or auto-merge Word files.

As for styling, there's wide variation in how lawyers use (and don't use) Word styles, cross-references, automatic numbering, etc. I've done a little work on programmatically outputting .docx contracts using the whole kitchen sink of Word features (https://github.com/CommonForm/commonform-docx), but have it on my to-do list to revert back to using .docx like RTF, without styles or fields.

If you're looking for a way to do Markdown to docx today, have a look at Pandoc.

@ghost
Copy link
Author

ghost commented Oct 30, 2014

Thanks for the speedy reply Kyle. .docx files are zipped collections of XML text files. In theory, the maintainer could "compile" the XML source code by zipping them and renaming the zip file for publishing in .docx form. But if the project is interested in the substance of the words only, I get that... was just curious.

@kemitchell
Copy link

Unfortunately, diff'ing document.xml files isn't much of an improvement on binary diff, given the amount of cruft. Word does strange things with ranges and paragraphs in those files, up to and including equivalent looking text in different ways.

If you're interested in working on a programmatic way to output pretty .docx from plain text or structured data contract descriptions, let's definitely chat. It's on my list for another project, where I'm trying to handle 100% in-browser.

@jrmiller82
Copy link

You guys should really switch over to something like Markup or LaTeX or Org mode in plain text so the diffs are super easy still; but, you can then output very pretty documents.

@kemitchell
Copy link

The files are currently in Markdown. A number of packages, including Pandoc, can convert to LaTeX and other formats.

@jrmiller82
Copy link

Oops. My bad. Sorry.

@ghost
Copy link
Author

ghost commented Oct 31, 2014

@kemitchell I skimmed through the Markdown syntax. It does not appear to support inline semantic markup (e.g., marking defined terms for styling). Am I understanding this right or is there some way in Markdown to duplicate something like HTML's span element approach.

@kemitchell
Copy link

@joejarvis Markdown is hard to generalize. Despite some efforts in the direction of standardization (http://commonmark.org/), implementations vary widely. Most all support inline HTML, which would get you , but at the price of turning the whole containing paragraph into an HTML literal where Markdown-style underscores and asterisks no longer apply. I'm sure there are some "extended" Markdown flavors with inline element styles, but these will be idiosyncratic, and probably lock you to a particular implementation.

@jrmiller82
Copy link

I prefer org or latex personally. More precise formatting choices.

Sent from

On Oct 31, 2014, at 11:24 AM, Kyle Mitchell [email protected] wrote:

@joejarvis Markdown is hard to generalize. Despite some efforts in the direction of standardization (http://commonmark.org/), implementations vary widely. Most all support inline HTML, which would get you , but at the price of turning the whole containing paragraph into an HTML literal where Markdown-style underscores and asterisks no longer apply. I'm sure there are some "extended" Markdown flavors with inline element styles, but these will be idiosyncratic, and probably lock you to a particular implementation.


Reply to this email directly or view it on GitHub.

@ghost
Copy link
Author

ghost commented Oct 31, 2014

Thanks @kemitchell. Looking over the inline section of the commonmark spec, I can see that the only "official" option for inline semantic markup is reverting to HTML, which would defeat the purpose of using Markdown. Bummer. I'll take a look at Pandoc when I have time.

@goodcounsel
Copy link

Putting aside the technical issues of doing .docx in GitHub, which I know nothing about, it is clear that the styling of the final documents is terribly bloated and inconsistent. The documents should simply be copied and pasted into clean Word docs, and the styling simplified. There's no reason to have nearly 50 different styles in this document, which really just has a few levels of numbered headings, and assorted others. And for all of these styles, the Article heading (Roman I, II, III, etc.) in the Certificate and the second level after (A, B, C, etc.) are not even auto-numbered! Maybe no one things anyone is going to edit these and it's therefore not necessary, but I (and I am sure others) do sometimes modify the base provisions.

@kemitchell
Copy link

@goodcounsel, alas, neither GitHub nor the Git tool that underlies it is well suited to .docx comparison, in part because even .docx files that seem simple in Word are nightmarish data junk-drawers "under the hood." Auto-numbering, in particular, is black magic, as evidenced by the problems even very expensive Word document comparison tools often have with it. Markdown, on the other hand, is very well supported on GitHub, and makes reviewing and editing via web browser about as straightforward as it can be.

When you mention fifty Word styles, are you referring to the .docx files from seriesseed.com in Microsoft Word? I see only standard styles in the version 3.2 clean copies in most-recent Word on Windows 7.

@jboehmig, are the .docx files on seriesseed.com generated from the .md automatically? If not, I can make a to-do item to PR a build system using pandoc, which does sane Markdown-to-Word conversion, and Travis CI to do the conversion and build a GitHub "release" of each new tagged commit automatically.

@goodcounsel
Copy link

Yes, right from the website. Here are some screenshots showing "styles in use" from the Word Style Organizer. There is some crazy explosion of styles.

screen shot 2015-05-01 at 11 28 59 am
screen shot 2015-05-01 at 11 28 49 am
screen shot 2015-05-01 at 11 28 41 am
screen shot 2015-05-01 at 11 28 28 am
screen shot 2015-05-01 at 11 28 19 am

@kemitchell
Copy link

@goodcounsel: I take it those styles are the calling card of Fenwick's house Word template or numbering plug-in. I will follow up with @jboehmig about an automated process for generating clean .docx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants