Closes #427 by nomisto · Pull Request #428 · bigscience-workshop/biomedical

nomisto · 2022-04-12T08:12:05Z

Closes #427

Dataset contains 8 different subset_id's (different dataset settings), each with a bigbio and source schema.

Furthermore there is an subset called mediqa_ans_all which includes all data (articles, sections, URLs of documents, all four different kinds of summaries, ...). I did not implement a bigbio schema for the all view as I think this does not make sense here. Since the bigbio schema is missing for all tests fail for subset mediqa_ans_all.

Tests:

python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_all
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_page2answer_multi_abstractive
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_page2answer_multi_extractive
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_page2answer_single_abstractive
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_page2answer_single_extractive
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_section2answer_multi_abstractive
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_section2answer_multi_extractive
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_section2answer_single_abstractive
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_section2answer_single_extractive

nomisto · 2022-04-12T08:23:13Z

biodatasets/mediqa_ans/mediqa_ans.py

+    def _source_to_t2t(self, example):
+        example_ = {}
+        example_["document_id"] = ""
+        example_["text_1_name"] = ""
+        example_["text_2_name"] = ""
+
+        text1 = ""
+        text1 += "Question ID: " + example["question_id"] + "\n"
+        text1 += "Question: " + example["question"] + "\n"
+        for article in example["articles"]:
+            text1 += "Answer ID: " + article["answer_id"] + "\n"
+            text1 += "Answer: " + article["text"] + "\n"
+            text1 += "Rating: " + article["rating"] + "\n"
+        example_["text_1"] = text1
+
+        example_["text_2"] = example["summary"]
+
+        return example_


This is the transformation of the source data to fit the t2t schema.
Basically the summarization works like: question + answer -> summarized_answer so for t2t schema I concatenated all interesting values with "\n" for the value of text_1.

An of example page2answer_single_abstractive:

"1_Answer4": { "summary": "Abetalipoproteimemia, also known as Bassen-Kornzweig syndrome, ... ", "articles": " Bassen-Kornzweig syndrome Abetalipoproteinemia Acanthocytosis Apolipoprotein B deficiency...", "question": "abetalipoproteimemia hi, I would like to know if there is any support for those suffering with abetalipoproteinemia? ...", "question_id": "1", "rating": "3-Incomplete" }

where "1_Answer4" is answer_id above and "articles" corresponds to article["text"]

sunnnymskang

@nomisto In the description part, can you add information about subset_id (and mediqa_ans_all implements only source)? Confirmed that all other 8 subset id pass unit tests

nomisto · 2022-04-26T07:58:05Z

Hi @sunnnymskang , Sure, I've added a description to the value of _DESCRIPTION and the docstring.

hakunanatasha · 2022-04-27T04:46:39Z

@nomisto Can you remind me why this fits the t2t schema better than question answering? We want to merge this PR asap; it looks mostly ok.

nomisto · 2022-04-27T06:08:56Z

Hi @hakunanatasha , the name of this dataset is a little misleading: It is a summarization task, more specifically an answer summarization task. So the input is question + answer and the task is to generate a summarization of that answer.

hakunanatasha · 2022-04-27T15:14:34Z

@nomisto got it; I'll merge this later today. Sorry for the hold up. I assume since it's a summarization, the text-1/2-name are also blank as there is nothing to update here.

Initial mediqa ans dataset

c6f6df3

nomisto requested review from galtay, hakunanatasha, jason-fries, leonweber, ruisi-su, sg-wbi and sunnnymskang as code owners April 12, 2022 08:12

changed label 'Text' to 'Answer' in text_1

3090d06

nomisto commented Apr 12, 2022

View reviewed changes

reformat

3033925

sunnnymskang self-assigned this Apr 12, 2022

hakunanatasha self-assigned this Apr 25, 2022

sunnnymskang added tricky schema bigbio schema doesn't fit this dataset easily and removed tricky schema bigbio schema doesn't fit this dataset easily labels Apr 26, 2022

sunnnymskang requested changes Apr 26, 2022

View reviewed changes

Added description of subsets

b07671c

nomisto requested a review from debajyotidatta as a code owner April 26, 2022 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #427#428

Closes #427#428
nomisto wants to merge 4 commits intobigscience-workshop:mainfrom
nomisto:mediqa_ans

nomisto commented Apr 12, 2022

Uh oh!

nomisto Apr 12, 2022

Uh oh!

sunnnymskang left a comment

Uh oh!

nomisto commented Apr 26, 2022

Uh oh!

hakunanatasha commented Apr 27, 2022

Uh oh!

nomisto commented Apr 27, 2022 •

edited

Loading

Uh oh!

hakunanatasha commented Apr 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nomisto commented Apr 12, 2022

Uh oh!

nomisto Apr 12, 2022

Choose a reason for hiding this comment

Uh oh!

sunnnymskang left a comment

Choose a reason for hiding this comment

Uh oh!

nomisto commented Apr 26, 2022

Uh oh!

hakunanatasha commented Apr 27, 2022

Uh oh!

nomisto commented Apr 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hakunanatasha commented Apr 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nomisto commented Apr 27, 2022 •

edited

Loading