Conversation
Abstract is build as follows:
`{title} {label}: {abstract.label}`
Mismatched offsets in 7 examples, all others pass
|
Some concrete examples of strange abstract creation Example 1: Example 2 In Example 1, the start and end match up if you include the "Title", but in Example 2, they match up if you exclude the word "Title". |
|
@phlobo What do we want to do with this dataset? It just contains the annotations but not the abstracts / texts. The latter could be downloaded via API however there might be a lot of offset errors due to changed content etc |
Would it be an option to include the abstracts (e.g., as a zip file) as part of the repo? I guess there are other datasets (MedMentions comes to my mind), that re-distribute Pubmed abstracts as part of a GitHub repo. |
Note: This dataset has a few issues
Is there a standard way these abstracts are formed?
biodatasets/my_dataset/my_dataset.py(please use only lowercase and underscore for dataset naming)._CITATION,_DATASETNAME,_DESCRIPTION,_HOMEPAGE,_LICENSE,_URLs,_SUPPORTED_TASKS,_SOURCE_VERSION, and_BIGBIO_VERSIONvariables._info(),_split_generators()and_generate_examples()in dataloader script.BUILDER_CONFIGSclass attribute is a list with at least oneBigBioConfigfor the source schema and one for a bigbio schema.datasets.load_datasetfunction.python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py. - Note