Skip to content

Conversation

@teonbrooks
Copy link
Member

Fix attempt as fixing the overflow issue in the read_raw_cnt reader. This error has manifested with numpy upgrade.

Reference issue

Fixes #13547.

What does this implement/fix?

This follows a pattern suggested in #12907 to cast the integer to int64.

@larsoner
Copy link
Member

To read your file it needs a few more fixes actually... I'll push

@larsoner
Copy link
Member

Definitely still something wrong here...

$ python -uic "import mne; raw = mne.io.read_raw_cnt('~/Desktop/945flankers_ready.cnt', data_format='int16').load_data(); raw.plot(annotation_regex='aaa')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import mne; raw = mne.io.read_raw_cnt('~/Desktop/945flankers_ready.cnt', data_format='int16').load_data(); raw.plot(annotation_regex='aaa')
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "<decorator-gen-190>", line 12, in load_data
  File "/home/larsoner/python/mne-python/mne/io/base.py", line 589, in load_data
    self._preload_data(True)
    ~~~~~~~~~~~~~~~~~~^^^^^^
  File "/home/larsoner/python/mne-python/mne/io/base.py", line 601, in _preload_data
    self._data = self._read_segment(data_buffer=data_buffer)
                 ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<decorator-gen-189>", line 12, in _read_segment
  File "/home/larsoner/python/mne-python/mne/io/base.py", line 420, in _read_segment
    data = _allocate_data(data_buffer, data_shape, dtype)
  File "/home/larsoner/python/mne-python/mne/io/base.py", line 2577, in _allocate_data
    data = np.zeros(shape, dtype)
numpy._core._exceptions._ArrayMemoryError: Unable to allocate 2.06 TiB for an array with shape (66, 4294966564) and data type float64

Same error if I use data_format='int32'. If I remove the .load_data and use data_format='int32' the plot at least looks okay

image

So need to figure out the n_samples issue, 4294966564 samples for 66 channels is totally unreasonable for a 150MB file...

@larsoner
Copy link
Member

@teonbrooks I'm done pushing/looking for now, I hope the changes I made help debugging a bit more. Something is wrong with n_samples here, it gets read as 4294966564 ...

@teonbrooks
Copy link
Member Author

thanks @larsoner!

came across this post and adding it here for reference in the future
https://paulbourke.net/dataformats/eeg/

@teonbrooks
Copy link
Member Author

@teonbrooks I'm done pushing/looking for now, I hope the changes I made help debugging a bit more. Something is wrong with n_samples here, it gets read as 4294966564 ...

according to the link above, it looks like this is not an uncommon occurrence:

Experience has shown that many (most) of the fields are not filled out correctly by the software. In particular, the best way to work out the number of samples is

it looks like n_samples should be calculated as:

nsamples = SETUP.EventTablePos - (900 + 75 * nchannels) / (2 * nchannels)

@larsoner
Copy link
Member

Great, can you add some comments / links in the code for the next time we dig into this, and try the suggested fix?

@teonbrooks
Copy link
Member Author

added a note. after trying it out and I look more closely at the code, it looks as though the n_samples logic is already there starting at https://github.com/mne-tools/mne-python/blob/main/mne/io/cnt/cnt.py#L339.

@teonbrooks
Copy link
Member Author

I actually don't know what to do about the n_samples. it looks like the code already is trying to best handle the data without knowing the data_format and with the header not having a reliable header entry for it.

@larsoner
Copy link
Member

@teonbrooks do you want me to take a look?

it looks like the code already is trying to best handle the data without knowing the data_format and with the header not having a reliable header entry for it.

So we have two potential sources of truth:

  1. n_samples, which is 4294966564 for this dataset
  2. SETUP.EventTablePos - (900 + 75 * nchannels) / (2 * nchannels), which is presumably correct for this dataset (right?)

In main we assume (1) is going to be more correct so we use it incorrectly for this file. Does that sound right?

If so, maybe we should prefer to use (2) if it's available, since it's more likely to be correct.

We could add some parameter to control which of these to prefer, too, if needed. We can even make it like n_samples="computed" (new default, option 2 above) | "read" (default on main, option 1 above) and change the default without a deprecation cycle since I think we can consider this a bugfix given the unreliability of "read"

@teonbrooks
Copy link
Member Author

@larsoner, yes, that would be a great help if you could take a look at it. and yep, I agree with your assessment on the number of samples

* upstream/main:
  BUG: Fix bug with error message check (mne-tools#13579)
  Add QC + Full MNE Report tutorial (mne-tools#13532)
  FIX: adding kit_system_id info to forward solution (mne-tools#13520)
  MAINT: Update code credit (mne-tools#13572)
  MAINT: Fix Circle (mne-tools#13574)
  DOC: Clarify read_raw_nirx expects directory path, not file path (mne-tools#13541)
@larsoner
Copy link
Member

larsoner commented Jan 9, 2026

Yikes this is a bit of a nightmare. Looks like in the header for event_table_offset and n_samples:

  1. Sometimes they are both correct, and consistent
  2. Sometimes event_table_offset is correct and n_samples is incorrect (like in the linked docs)
  3. Sometimes event_table_offset is incorrect and n_samples is correct (like in some of our test datasets)
  4. event_table_offset will always be wrong for file sizes > 2 GB

So when I said before:

If so, maybe we should prefer to use (2) [event_table_offset-computed value] if it's available, since it's more likely to be correct.

I'm no longer convinced this is a good idea, given it's not even the case for our test datasets! To make things worse, all of this stuff interacts with data_format, which we allow to be "auto". So thinking about it more, how about we:

  1. Add n_samples="header" (default) | "computed" where "computed" means "compute it using event_table_offset and data_format"
  2. Only allow data_format="auto" when n_samples="header" or data_format != "auto" and file size < 2GB, because you can't figure out both data_format and n_samples given just an event_table_pos (and event_table_pos is only even potentially usable for file sizes < 2 GB) -- you can just as easily say there are 2 bytes per sample and X samples or 4 bytes per sample and X//2 samples.

Also, looking at the docs from https://paulbourke.net/dataformats/eeg/, their example file has the data offset (SETUP+ELECTLOC) at 5550, which runs to EventTablePos=8511950 for 62 channels. They say this span is n_samples * n_channels * 2 (so data_format="int16"). But if you take their math at the top nsamples = SETUP.EventTablePos - (900 + 75 * nchannels) / (2 * nchannels) you get a very wrong value 8511950-(900+75*62)/(2*62) = 8511905.241935484. Correcting their algebra to be what I think should be correct (EventTablePos-(900+75*n_channels))/(n_channels*n_bytes) we get a more reasonable 68675.0 (note the whole value on this float div, which is good / suggests possible correctness once we convert to integer arithmetic!)... but according to the doc itself the number of samples is 68600! So somehow there are 75 extra values here. I wondered if this could be the source of #11802, but I also see this is the data @teonbrooks shared -- computing the number of samples from the event_table_offset I see ~1865 samples toward the end of the data that are almost all (but not all!) zeros. Wouldn't be surprised if this is the result of something writing un-zeroed malloc'ed rather than calloc'ed data or something... in any case, we should revisit #11802 once we work out the solutions above since maybe we'll magically fix that issue, too.

@withmywoessner you've worked on this stuff a bit recently... WDYT?

xref previous nightmares #6535 #6537 #11802 #12393

@larsoner larsoner added this to the 1.12 milestone Jan 9, 2026
@teonbrooks
Copy link
Member Author

But if you take their math at the top nsamples = SETUP.EventTablePos - (900 + 75 * nchannels) / (2 * nchannels) you get a very wrong value 8511950-(900+75*62)/(2*62) = 8511905.241935484. Correcting their algebra to be what I think should be correct (EventTablePos-(900+75*n_channels))/(n_channels*n_bytes) we get a more reasonable 68675.0

this was exactly where I got stumped as well! trying to understand this gave me a massive headache 🤕

For setting the data_format for files >2GB, would this be a matter of the user having to try out both options to see if the data makes sense?

@larsoner
Copy link
Member

larsoner commented Jan 9, 2026

For setting the data_format for files >2GB, would this be a matter of the user having to try out both options to see if the data makes sense?

Yeah I think so. It's not great but I think it's probably better that they be explicit. Hopefully it's pretty obvious by eye, your data looked completely wrong for int16 but reasonable for int32 for example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Overflow Error with read_raw_cnt reader

2 participants