Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap version: Adding more flexibility to the DSL vocabulary #26

Open
PaulGwamanda opened this issue May 30, 2018 · 11 comments
Open

Comments

@PaulGwamanda
Copy link

The DSL is simple and straight forward: A token per matching html snippet. However, in a real world scenario many classes would overlap and intermix with each other ie:

<div class="col-md-3">{}</div>

could be:

<div class="col-md-3 bg-primary">{}</div>

or

div class="col-md-3 border border-primary">{}</div>

It seems the structure of the DSL requires you to have a very large vocabulary with thousands of tokens but it would still not solve the flexibility problem. How would you approach solving this problem?

Is an emmet style vocabulary like the below possible?

{
"quadruple.border+border-primary": "<div class=\"col-lg-3 border border-primary \">\n{}\n</div>\n"
}

@emilwallner
Copy link
Owner

Great insight Paul. I'd say it's one of the most important research areas to advance this field. I'd love to see a paper or blog article exploring this topic in depth. Outlining the key constraints for the generator and key features for the tech, e.g. scalability, modularity, and capabilities. I haven't though this through enough to add value in the discussion.

@PaulGwamanda
Copy link
Author

I've written a compiler that solves this (I'll paste the code below). Basically the compiler takes tokens from the GUI file and appends them to the web-dsl-mapping.json file with a friendly naming convention

So that a gui of:

quadruple.border+border-primary

becomes:

{"quadruple__border_border-primary": "<div class=\"col-lg-3 border border-primary\">\n{}\n</div>"}

The friendly DSL naming convention takes the character "." (which denotes classses) and turns it into "__", and also takes "+" which denotes additional classes and turns them into "_"

This provides much more flexibility in the dsl mappings. I have not tested it in floydhub yet but I will in a couple of days and do a pull request:

import shutil
import glob
import json
import re

dsl_file_path = "assets/web-dsl-mapping.json"

# Create the special_tokens.txt file
special_tokens = 'data/train/special_tokens.txt'
f = open(special_tokens, 'a+')
f.write(' ')
f.close()

# Then find all '.gui' file extensions in data folder and combine them
with open(special_tokens, 'wb') as outfile:
    for gui_tokens in glob.glob('*/*/*.gui'):
        if gui_tokens == special_tokens:
            # don't want to copy the output into the output
            continue
        with open(gui_tokens, 'rb') as readfile:
            shutil.copyfileobj(readfile, outfile)
    
# find all tokens with class separator ('.')
with open(special_tokens, "r") as filename:
    
    # then create a list of these tokens and rename the characters 
    tokens = [x.replace('{', '') for all_tokens in filename for x in all_tokens.split(',') if '' in all_tokens]
    # for friendly DSL naming convention make "__" for "." and "_" for "+"
    tokens = [x.strip().replace('.', '__').replace('+', '_') for xs in tokens for x in xs.split(',') if '.' in x]
    # then convert tokens into dictionary keys with empty values
    tokens = dict.fromkeys(tokens, '')
    tokens_for_commpare = tokens
    
    # find DSL's files and map them with the matching snippet for first word in keys
    with open(dsl_file_path, 'r') as dsl_tokens:
        # load the web-dsl-mapping.json file
        data = dsl_tokens.read()
        dsl_tokens = json.loads(data)
        # if key in tokens equals key in dsl_tokens get value of  dsl_tokens and append to tokens
        new_tokens = list(tokens)
        renamed_keys = [x.replace("__", "+").replace("_", " ").replace("+", "__") for x in sorted(new_tokens)]
        # split tokens after point and append to new dictionary with split values being key/value pairs
        rename_keys_all_underscored = [x.replace("__", "+").replace("+", "__") for x in sorted(new_tokens)]

# create a dictionary with title and classes as values
split_tokens = dict(s.split('__') for s in renamed_keys)
copy_of_dsl_filtered = {key:dsl_tokens[key].strip() for key in split_tokens if key in dsl_tokens}

tokens_with_underscores = dict.fromkeys(rename_keys_all_underscored, ' ')
# d = {k: v.replace('class=\"', '' + 'class=\"' + t1[k] + ' ') for k, v in sorted(t2.items())}

# import re to find regex pattern of "class\=" 
tokens_with_classes_inserted = {a:re.sub('(?<=class\=")[\w+\s\-\_]+(?=")', lambda x:x.group()+' '+split_tokens[a], b) for a, b in sorted(copy_of_dsl_filtered.items())}

# Update keys of tokens_with_classes_inserted with keys of tokens_with_underscores then
# insert values of tokens_with_underscores last into the 'class=\' tag in tokens_with_classes_inserted
corrected_dict = { k.replace(':', ''): v for k, v in tokens_with_classes_inserted.items() }

# create a key-mapping dict out of the tokens_with_underscores.keys()
keymap = {key.partition('_')[0]: key for key in tokens_with_underscores}

# apply keymap to the keys in tokens_with_classes_inserted
updated_tokens_with_classes_inserted = {keymap.get(key, key): value for key, value in tokens_with_classes_inserted.items()}

# append updated_tokens_with_classes_inserted to dsl_tokens
updated_tokens_with_classes_inserted.update(dsl_tokens)

# Write updated tokens list to web-dsl-mapping.json
with open(dsl_file_path, 'w') as f:
    f.write(json.dumps(updated_tokens_with_classes_inserted, indent=2, sort_keys=True))

@emilwallner
Copy link
Owner

Excellent, I'm looking forward to see how it performs!

@PaulGwamanda
Copy link
Author

So I've been testing this on my larger vocab of 270 tokens using an updated compiler and It seems the network doesn't perform all too well, bummer.

I suspect it has to do with the one-hot encoding which as the paper says does not scale very well to a large vocabulary and thus restricts the number of tokens in the DSL.

I'll also adjust the T = 48 size of the sliding window and analyse the different outcomes.

@emilwallner
Copy link
Owner

Interesting find Paul, what's the BLEU score (four n-gram, greedy search)?

Here are some ideas on the top of my mind:

  • Start with a small vocab, and then gradually add more tokens, e.g. by masking the output and redistributing the prediction on the available tokens.
  • Pre-train the LSTM before you train it end-to-end.
  • Create one channel for each DIV and add relational reasoning
  • Apply attention

Keep us in the loop!

@PaulGwamanda
Copy link
Author

So I'll be using MTurk to get a much larger dataset, my custom dataset performs reasonably well on my custom vocab off a couple hundred real world web screenshots. My BLEU score was quite low but the markup was coherent, clumsy but coherent needed more epochs I think. I haven't tweaked the network much just a few parameters here and there but I know more data is key. With MTurk I should have a couple thousand guis and be able to really test on real world scenarios.

@emilwallner
Copy link
Owner

@PaulGwamanda Awesome, let me know how it goes. I'm curious why you didn't further develop a generator with an extended DSL and screenshots, or scrape existing websites and clean it?

@PaulGwamanda
Copy link
Author

The generator solves the markup problem like grids, buttons, cols etc but it doesn't solve real world examples like more complex layouts, fonts, colors, OCR, animations etc. Using Bootstrap's documentation and web templates around the web we can steadily extend the DSL to include more complex tokens, what I found is that once the DSL is mapped out, labelling a GUI takes roughly 15-20 minutes, on average 120 tokens. This example here takes 15 minutes if you're just labeling the layout markup features(grids, cols, buttons). I want to solve the layout problem first then as you suggest move on to the more complex problems. Once i have a few thousand guis like this I'll utilize the power of GANS to further multiply it

@PaulGwamanda
Copy link
Author

I'll be pushing my fork this week here: https://github.com/PaulGwamanda/Pix2code-Screenshot-to-code-dataset-builder, it will include a custom dataset (100 images), a dataset builder, training scripts, a flask API and a complete DSL library based on Bootstrap V4.

My current dataset is 2500 images (over 286 000 training samples,) if anyone is interested in the dataset email me at [email protected] for a reasonable price :)

@Mouradif
Copy link

Mouradif commented Sep 9, 2021

My current dataset is 2500 images (over 286 000 training samples,) if anyone is interested in the dataset email me at [email protected] for a reasonable price :)

@PaulGwamanda sending you an email now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants