Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 11 additions & 4 deletions IDEAS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
### StructuredOutput
- The HuggingFaceProvider and OllamaProvider are currently implemented but very brittle for Structured Output.
- They often follow the specified JSON schema but not always.
- Options: remove brittle providers, fix the providers with support from Ollama and HF dev, implement a more robust solution for constraining outputs like outlines (where possible)
- Options: remove brittle providers, fix the providers with support from Ollama and HF dev, implement a more robust solution for constraining outputs using techniques like outlines (where possible)

### Include validation of user-defined prompt
- For each dataset type there are possibly different mandatory variables to inject (and to define)
Expand All @@ -26,7 +26,7 @@
- Right now the dataset is consolidated at the end of the generation process.
- This is terrible because we have to wait for the end to inspect and save it.
- With this we lose time, money, and potentially the dataset if something goes wrong.
- Each line should be written as soon as it is generated.
- Each data row should be written to the output file as soon as it is generated.

### Generate sample
- It might be useful to generate a sample of the dataset to inspect it before the full generation process.
Expand All @@ -45,14 +45,21 @@
- Check what options LiteLLM offers. This is another reason to migrate to LiteLLM.

### Reasoning Dataset
- Simple DeepSeek distilation?
- Simple DeepSeek distillation?
- Agentic Generation?
- We can have a look at CamelAI, but also check how other reasoning dataset are generated.
- First I would like to switch to using LiteLM for inference, and also test some approaches. Then we work on it.

### Tests
- There are no proper tests, just some scripts within `examples`.
- We should add proper tests soon after launch

### EvolInstruct not implemented
- At the moment EvolInstruct is not implemented in PreferenceDataset Generation.
- While this is not critical at this stage, it would be great to have it soon.
- While this is not critical at this stage, it would be great to have it soon.


### Need to align on the naming across the codebase
- `prompt` vs `query` vs `instruction` for instance should be clarified
- If specifying `Text` in front of `TextClassificationDataset` is relevant, why don't we use similar prefixes with other datasets?
- Since datafast will most likely only address Text datasets, we can probably drop the `Text` specifier.
Loading