Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creating dataset #38

Open
Kotresh17 opened this issue May 20, 2019 · 4 comments
Open

creating dataset #38

Kotresh17 opened this issue May 20, 2019 · 4 comments

Comments

@Kotresh17
Copy link

Hi, many thanks for sharing the data and code. how can we take it forward, how can we generate more data apart from synthesised data. can we create same kind of dataset for real time html page. if so, then how can we generate .gui files for that. if you have any resource or any thoughts please do share us.

@PaulGwamanda
Copy link

PaulGwamanda commented May 27, 2019

Hi Kotresh,

For the bootstrap version you could write a script that takes screenshots of existing bootstrap website templates and build a DSL vocabulary vocabulary based off that. It should be pretty straightforward with the structure looking like the pix2code datasets and DSL

So for example a website that looks like this: https://imgur.com/a/IF3NxTV

Would have a .gui that looks something like below:

header{
    navigation-top{
        logo,
        menu-right{
            menu-link-active,
            menu-link,
            menu-link,
            menu-link
        }
    }
}
main-heading,
row{
    col-3{
       link{
            image
        }
    }
    col-3{
       link{
            image
        }
    }
    col-3{
       link{
            image
        }
    }
footer{
    row-centered{
       text
    }
}

For the HTML version, quoting the issue from Emil:

#20

“As mentioned in the article, the HTML version does not generalize on new images. The Bootstrap version generalizes on new images but with a capped vocabulary. The evaluation images for the bootstrap version are under /data/eval/ . You can test it here: floydhub/Bootstrap/test_model_accuracy.ipynb

If you want to train it to generalize on a more advanced vocabulary, I'd recommend customizing it to work on the HTML set provided here: https://github.com/harvardnlp/im2markup (on floydhub: --data emilwallner/datasets/100k-html:data)

After that, I'd recommend creating a new dataset. Create a script that generates random websites, say starting with newsletters or blog layouts. Then you can add optical character recognition, fonts, colors and div sizes as you go.

If you build a version for the harvardnlp dataset or a script that generates websites, please make a pull request.”

@yuvarajvc
Copy link

yuvarajvc commented Sep 5, 2019

Hi, thanks for sharing the data and code.
Can you please tell how to create .npz and corresponding .gui files for our custom images. if you have any thoughts please do share us, it will be really helpful for us to proceed.
for example: i have attached basic form image, can you please share your thoughts how to convert this image to .npz and .gui form to train with the model so that i can get the html code for similar images.
screentocode

@salmanahmad10
Copy link

hi I'm pretty late to this but I was just wondering what is a .gui file and how do you open it ?
thankyou

@PaulGwamanda
Copy link

PaulGwamanda commented Jan 5, 2021

@yuvarajvc:
You can convert an image to a compressed .npz file using my script here: https://gist.github.com/PaulGwamanda/f91ce9fc9d392c4bcc99c085fd726a34

@salmanahmad10:
Any code editor can view and edit a .gui file.

The .gui name extension convention was used by the original paper (Pix2code) and has no special relevance. The project uses the .gui file to map the corresponding token sequence relationship to it's image pair which has the same name.

ie. image1.png (or .npz when compressed) should have a corresponding .gui file called image1.gui which has it's textual token features representing the description of the image

PS I'm pushing my dev toolkit here which includes 100 *samples and will be happy to sell my whole dataset. Email me at [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants