To address the wide range of questions posed by users, LVLMs need to possess diverse capabilities. For a comprehensive evaluation, we curate 61 benchmark datasets from existing resources, summarizing the assessed capabilities into 2 major categories and 8 sub-categories.
Assessed capability dimensions and tasks in ReForm-Eval. “Desc” and “Classif.” are respectively short for description and classification.In order to make it easier for users to use our benchmark, we further convert the formulated dataset and store image in the form of base64
. We called this converted dataset ReForm-Eval-Data, which is uploaded to Hugging Face Hub and Google Drive. You can load our dataset directly from our Hugging Face repository or from the local path, avoiding the hassle of manual downloading, so this is also the most recommended method.
Please set --hf
or --offline_hf
if you would like to load ReForm-Eval-Data when using our framework (run_eval.py
and run_loader_eval.py
). --hf
is loading from Hugging Face Hub, and --offline_hf
is loading ReForm-Eval-Data from the local path. If set at the same time, data will be loaded from Hugging Face Hub.
Please set load_from_hf=True
or offline_from_hf=True
if you would like to load ReForm-Eval-Data when using Data Loader (from build import load_reform_dataset
). load_from_hf=True
is loading from Hugging Face Hub, and offline_from_hf=True
is loading ReForm-Eval-Data from the local path. If True
is set at the same time, data will be loaded from Hugging Face Hub.
In our repository, huggingface_data
is the relative path in Hugging Face Hub and is configured in /path/to/ReForm-Eval/build/configs/DisasterType_val.yaml
as shown below:
dataset: 'MEDIC'
task: 'dts' # disaster type selection
data_config:
load_from_bootstrap: True
image_path: "/remote-home/share/multimodal-datasets/raw_datasets/MEDIC/data"
medic_path: "/remote-home/share/multimodal-datasets/Gen_Eval/Disaster-Type-Selection/disaster-type-selection-sampled.json"
huggingface_data: "huggingface_data/MEDIC/disaster-type-selection-sampled.json" # the relative path in Hugging Face Hub
offline_huggingface_data: "ReForm-Eval-Data/huggingface_data/MEDIC/disaster-type-selection-sampled.json" # the relative local path of Hugging Face data
In /path/to/ReForm-Eval/build/MEDIC/disaster_type_dataset.py
, the specific data in Hugging Face Hub is loaded directly from the path in the config file using load_dataset
function, so no changes are needed.
if args.hf:
data = load_dataset("Aweminus/ReForm-Eval-Data",data_files={'test':self.config['data_config']['huggingface_data']}, split='test')
If you cannot access Hugging Face, you can use the following command to download the dataset, and then load the dataset locally.
git clone
git lfs install
git clone https://huggingface.co/datasets/Aweminus/ReForm-Eval-Data
download URL
https://drive.google.com/file/d/1GjWvm0f6fkJ7VFySKyEfb2N_KyZxcdyI/view
wget
wget https://drive.google.com/uc?export=download&id=1GjWvm0f6fkJ7VFySKyEfb2N_KyZxcdyI
When you git clone the dataset or place the ReForm-Eval-Data
folder on the root directory of this repository , offline_huggingface_data
does not need to be modified.
|-- ReForm-Eval
|-- ReForm-Eval-Data
|-- huggingface_data
|-- A-OKVQA
|-- A-OKVQAR
...
|-- build
|-- commands
|-- metrics
|-- models
...
Otherwise it needs to be modified:
dataset: 'MEDIC'
task: 'dts' # disaster type selection
data_config:
load_from_bootstrap: True
image_path: "/remote-home/share/multimodal-datasets/raw_datasets/MEDIC/data"
medic_path: "/remote-home/share/multimodal-datasets/Gen_Eval/Disaster-Type-Selection/disaster-type-selection-sampled.json"
huggingface_data: "huggingface_data/MEDIC/disaster-type-selection-sampled.json" # the relative path in Hugging Face Hub
offline_huggingface_data: "ReForm-Eval-Data/huggingface_data/MEDIC/disaster-type-selection-sampled.json" # The place you may need to modify (the relative local path of Hugging Face data)
If you are really interested in exactly how we formulate the data and desire to check out the ReForm-Eval-Data json file, use the following code:
from datasets import load_dataset
# You can add (field="data") in parameters for extracting "data" keys.
# Load from the Hugging Face Hub
dataset = load_dataset("Aweminus/ReForm-Eval-Data",data_files={'test':'huggingface_data/MEDIC/disaster-type-selection-sampled.json'}, split='test')
# Load from the local path
dataset = load_dataset("json",data_files={'test':'/path/to/disaster-type-selection.json'}, split='test')
If you intend to check out one sample of our formulated data, mostly, you should add [0]
between dataset
and ['data']
, which is different from json.load
.
dataset = dataset[0]['data'][n] #n: The `n` th sample you want to check out
However, a few json files for some dataset such as TDIUC, you do not need to specify a ['data']
field, or add [0]
between dataset
and [0]
to get a sample.
dataset = dataset[n] #n: The `n` th sample you want to check out
We saved the image in the form of base64
in all json files. These processed texts are restored to complete images by PIL when the dataset is built.
Alternatively, all datasets are also provided with URLs and you can manually download them.
After downloading all dataset, you need to modify following paths of config files in PATH_TO_REFORM-EVAL/datasets/configs/
.
dataset: 'MEDIC'
task: 'dts' # disaster type selection
data_config:
load_from_bootstrap: True
image_path: "/remote-home/share/multimodal-datasets/raw_datasets/MEDIC/data" #The place you need to modify
medic_path: "/remote-home/share/multimodal-datasets/Gen_Eval/Disaster-Type-Selection/disaster-type-selection-sampled.json" #The place you need to modify
huggingface_data: "huggingface_data/MEDIC/disaster-type-selection-sampled.json"
offline_huggingface_data: "ReForm-Eval-Data/huggingface_data/MEDIC/disaster-type-selection-sampled.json"
We also provide the raw json file, like the one pointing to "medic_path".
download URL
https://drive.google.com/file/d/1D4CH9_RJKoCGFqDy5eIhG7h-ZRllgSfc/view
wget
wget https://drive.google.com/uc?export=download&id=1D4CH9_RJKoCGFqDy5eIhG7h-ZRllgSfc
For multi-round VQA tasks, different from VisDial to perform offline multi-round dialogue (use GT in the dialogue history), we consider online multi-round dialogue (use previous output in the dialogue history).
In our framework, we use the "--online_multi_round" parameter to indicate the setting.
If you perform online multi-round dialogue without out framework, you need to be careful to update the history in the dataset during the evaluation, here is the example of this procedure in our framework (in run_eval.py):
def get_pred_result(samples, prediction, metric):
history_result = []
# iterate through the prediction batch
for i in range(len(prediction)):
# detect whether the prediction matches a opton
correct, final_pred = metric(prediction[i], samples['answer'][i], samples['answer_options'][i])
if final_pred is None:
# if the prediction does not match the option, then keep it
final_pred = prediction[i]
else:
# then map the prediction to the original option
try:
final_pred = samples['answer_options'][i][final_pred]
except:
print('found invalid prediction: {}'.format(prediction[i]))
final_pred = prediction[i]
# raise ValueError
history_result.append([samples['sample_id'][i], final_pred])
return history_result
def gpu_info(gpu_index):
gpu_status = os.popen('nvidia-smi | grep %').read().split('\n')[gpu_index].split('|')
power = int(gpu_status[1].split()[-3][:-1])
memory = int(gpu_status[2].split('/')[0].strip()[:-3])
return power, memory
for batch in tqdm.tqdm(dataloader, desc='evaluating'):
# the inference process
if args.infer_method == 'generation':
res = model(batch['image'], batch['text'], **generation_kwargs)
else:
res = model(batch['image'], batch['text'], batch['answer_options'], **likelihood_kwargs)
# get the prediction from the output
generated_history_infos = get_pred_result(batch, res, metric)
# gather all predictions from all gpus
gathered_history = [i for i in range(args.n_gpus)]
dist.all_gather_object(gathered_history, generated_history_infos)
# make the update to the dialog history in the dataset
"""
gathered_history: List[List[str, str]], each element is a list of the sample_id and prediction result.
Here is an example of dataset_duplication=5, our model finishes the prediction for the first round in the VisDial_00 sample.
>>> gathered_history
[['VisDial_00_round0', 'yes'], ['VisDial_00_round0', 'no'], ['VisDial_00_round0', 'yes'], ['VisDial_00_round0']]
"""
dataset.update_history(gathered_history)
Then in the next round, the history will be updated with the prediction "yes".
Also notice that "round_id" should be included in the output json, during evaluation, use "--multi_round_eval" to evaluate the relationship between the model performance and the number of dialogue rounds.