Bootstrap version: Adding more flexibility to the DSL vocabulary #26

PaulGwamanda · 2018-05-30T14:54:13Z

The DSL is simple and straight forward: A token per matching html snippet. However, in a real world scenario many classes would overlap and intermix with each other ie:

<div class="col-md-3">{}</div>

could be:

<div class="col-md-3 bg-primary">{}</div>

or

div class="col-md-3 border border-primary">{}</div>

It seems the structure of the DSL requires you to have a very large vocabulary with thousands of tokens but it would still not solve the flexibility problem. How would you approach solving this problem?

Is an emmet style vocabulary like the below possible?

{
"quadruple.border+border-primary": "<div class=\"col-lg-3 border border-primary \">\n{}\n</div>\n"
}

The text was updated successfully, but these errors were encountered:

emilwallner · 2018-05-31T09:57:37Z

Great insight Paul. I'd say it's one of the most important research areas to advance this field. I'd love to see a paper or blog article exploring this topic in depth. Outlining the key constraints for the generator and key features for the tech, e.g. scalability, modularity, and capabilities. I haven't though this through enough to add value in the discussion.

PaulGwamanda · 2018-06-19T11:04:47Z

I've written a compiler that solves this (I'll paste the code below). Basically the compiler takes tokens from the GUI file and appends them to the web-dsl-mapping.json file with a friendly naming convention

So that a gui of:

quadruple.border+border-primary

becomes:

{"quadruple__border_border-primary": "<div class=\"col-lg-3 border border-primary\">\n{}\n</div>"}

The friendly DSL naming convention takes the character "." (which denotes classses) and turns it into "__", and also takes "+" which denotes additional classes and turns them into "_"

This provides much more flexibility in the dsl mappings. I have not tested it in floydhub yet but I will in a couple of days and do a pull request:

import shutil
import glob
import json
import re

dsl_file_path = "assets/web-dsl-mapping.json"

# Create the special_tokens.txt file
special_tokens = 'data/train/special_tokens.txt'
f = open(special_tokens, 'a+')
f.write(' ')
f.close()

# Then find all '.gui' file extensions in data folder and combine them
with open(special_tokens, 'wb') as outfile:
    for gui_tokens in glob.glob('*/*/*.gui'):
        if gui_tokens == special_tokens:
            # don't want to copy the output into the output
            continue
        with open(gui_tokens, 'rb') as readfile:
            shutil.copyfileobj(readfile, outfile)
    
# find all tokens with class separator ('.')
with open(special_tokens, "r") as filename:
    
    # then create a list of these tokens and rename the characters 
    tokens = [x.replace('{', '') for all_tokens in filename for x in all_tokens.split(',') if '' in all_tokens]
    # for friendly DSL naming convention make "__" for "." and "_" for "+"
    tokens = [x.strip().replace('.', '__').replace('+', '_') for xs in tokens for x in xs.split(',') if '.' in x]
    # then convert tokens into dictionary keys with empty values
    tokens = dict.fromkeys(tokens, '')
    tokens_for_commpare = tokens
    
    # find DSL's files and map them with the matching snippet for first word in keys
    with open(dsl_file_path, 'r') as dsl_tokens:
        # load the web-dsl-mapping.json file
        data = dsl_tokens.read()
        dsl_tokens = json.loads(data)
        # if key in tokens equals key in dsl_tokens get value of  dsl_tokens and append to tokens
        new_tokens = list(tokens)
        renamed_keys = [x.replace("__", "+").replace("_", " ").replace("+", "__") for x in sorted(new_tokens)]
        # split tokens after point and append to new dictionary with split values being key/value pairs
        rename_keys_all_underscored = [x.replace("__", "+").replace("+", "__") for x in sorted(new_tokens)]

# create a dictionary with title and classes as values
split_tokens = dict(s.split('__') for s in renamed_keys)
copy_of_dsl_filtered = {key:dsl_tokens[key].strip() for key in split_tokens if key in dsl_tokens}

tokens_with_underscores = dict.fromkeys(rename_keys_all_underscored, ' ')
# d = {k: v.replace('class=\"', '' + 'class=\"' + t1[k] + ' ') for k, v in sorted(t2.items())}

# import re to find regex pattern of "class\=" 
tokens_with_classes_inserted = {a:re.sub('(?<=class\=")[\w+\s\-\_]+(?=")', lambda x:x.group()+' '+split_tokens[a], b) for a, b in sorted(copy_of_dsl_filtered.items())}

# Update keys of tokens_with_classes_inserted with keys of tokens_with_underscores then
# insert values of tokens_with_underscores last into the 'class=\' tag in tokens_with_classes_inserted
corrected_dict = { k.replace(':', ''): v for k, v in tokens_with_classes_inserted.items() }

# create a key-mapping dict out of the tokens_with_underscores.keys()
keymap = {key.partition('_')[0]: key for key in tokens_with_underscores}

# apply keymap to the keys in tokens_with_classes_inserted
updated_tokens_with_classes_inserted = {keymap.get(key, key): value for key, value in tokens_with_classes_inserted.items()}

# append updated_tokens_with_classes_inserted to dsl_tokens
updated_tokens_with_classes_inserted.update(dsl_tokens)

# Write updated tokens list to web-dsl-mapping.json
with open(dsl_file_path, 'w') as f:
    f.write(json.dumps(updated_tokens_with_classes_inserted, indent=2, sort_keys=True))

emilwallner · 2018-06-19T11:36:06Z

Excellent, I'm looking forward to see how it performs!

PaulGwamanda · 2018-08-17T08:26:23Z

So I've been testing this on my larger vocab of 270 tokens using an updated compiler and It seems the network doesn't perform all too well, bummer.

I suspect it has to do with the one-hot encoding which as the paper says does not scale very well to a large vocabulary and thus restricts the number of tokens in the DSL.

I'll also adjust the T = 48 size of the sliding window and analyse the different outcomes.

emilwallner · 2018-08-17T09:32:03Z

Interesting find Paul, what's the BLEU score (four n-gram, greedy search)?

Here are some ideas on the top of my mind:

Start with a small vocab, and then gradually add more tokens, e.g. by masking the output and redistributing the prediction on the available tokens.
Pre-train the LSTM before you train it end-to-end.
Create one channel for each DIV and add relational reasoning
Apply attention

Keep us in the loop!

PaulGwamanda · 2019-05-27T16:21:33Z

So I'll be using MTurk to get a much larger dataset, my custom dataset performs reasonably well on my custom vocab off a couple hundred real world web screenshots. My BLEU score was quite low but the markup was coherent, clumsy but coherent needed more epochs I think. I haven't tweaked the network much just a few parameters here and there but I know more data is key. With MTurk I should have a couple thousand guis and be able to really test on real world scenarios.

emilwallner · 2019-05-28T17:59:55Z

@PaulGwamanda Awesome, let me know how it goes. I'm curious why you didn't further develop a generator with an extended DSL and screenshots, or scrape existing websites and clean it?

PaulGwamanda · 2019-05-29T08:29:35Z

The generator solves the markup problem like grids, buttons, cols etc but it doesn't solve real world examples like more complex layouts, fonts, colors, OCR, animations etc. Using Bootstrap's documentation and web templates around the web we can steadily extend the DSL to include more complex tokens, what I found is that once the DSL is mapped out, labelling a GUI takes roughly 15-20 minutes, on average 120 tokens. This example here takes 15 minutes if you're just labeling the layout markup features(grids, cols, buttons). I want to solve the layout problem first then as you suggest move on to the more complex problems. Once i have a few thousand guis like this I'll utilize the power of GANS to further multiply it

PaulGwamanda · 2021-01-05T17:32:40Z

I'll be pushing my fork this week here: https://github.com/PaulGwamanda/Pix2code-Screenshot-to-code-dataset-builder, it will include a custom dataset (100 images), a dataset builder, training scripts, a flask API and a complete DSL library based on Bootstrap V4.

My current dataset is 2500 images (over 286 000 training samples,) if anyone is interested in the dataset email me at [email protected] for a reasonable price :)

Mouradif · 2021-09-09T14:52:24Z

My current dataset is 2500 images (over 286 000 training samples,) if anyone is interested in the dataset email me at [email protected] for a reasonable price :)

@PaulGwamanda sending you an email now

PaulGwamanda · 2021-09-09T20:24:27Z

Sure, an the example dataset is here:

Data:
https://github.com/PaulGwamanda/Pix2code-Screenshot-to-code-dataset-builder/tree/master/datasets/sample/png-pairs

DSL:
https://github.com/PaulGwamanda/Pix2code-Screenshot-to-code-dataset-builder/tree/master/dsl-library/DSL/output

Vocab:
https://github.com/PaulGwamanda/Pix2code-Screenshot-to-code-dataset-builder/blob/master/model/model/resources/bootstrap.vocab

You can modify the DSL to fit whatever data you NEED

emilwallner closed this as completed May 31, 2018

emilwallner reopened this May 31, 2018

emilwallner added the help wanted label May 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bootstrap version: Adding more flexibility to the DSL vocabulary #26

Bootstrap version: Adding more flexibility to the DSL vocabulary #26

PaulGwamanda commented May 30, 2018

emilwallner commented May 31, 2018

PaulGwamanda commented Jun 19, 2018

emilwallner commented Jun 19, 2018

PaulGwamanda commented Aug 17, 2018

emilwallner commented Aug 17, 2018

PaulGwamanda commented May 27, 2019

emilwallner commented May 28, 2019

PaulGwamanda commented May 29, 2019

PaulGwamanda commented Jan 5, 2021

Mouradif commented Sep 9, 2021

PaulGwamanda commented Sep 9, 2021

Bootstrap version: Adding more flexibility to the DSL vocabulary #26

Bootstrap version: Adding more flexibility to the DSL vocabulary #26

Comments

PaulGwamanda commented May 30, 2018

emilwallner commented May 31, 2018

PaulGwamanda commented Jun 19, 2018

emilwallner commented Jun 19, 2018

PaulGwamanda commented Aug 17, 2018

emilwallner commented Aug 17, 2018

PaulGwamanda commented May 27, 2019

emilwallner commented May 28, 2019

PaulGwamanda commented May 29, 2019

PaulGwamanda commented Jan 5, 2021

Mouradif commented Sep 9, 2021

PaulGwamanda commented Sep 9, 2021