View Single Post
  #3  
Old 05-28-2023, 06:39 PM
Spliff Spliff is offline
Registered User
 
Join Date: 04-07-2021
Posts: 207
Re CSV Import

Having done extensive tries with "real" data now for CSV import, I have discovered that the tree replication is not without fault, example (indentation levels):

1 ok (the unique source element; the target element in UR being considered 0)
2 ok
3 ok
2 stays at 3 (instead of being positioned "up" again, with, or without, content; btw, I put systematically the tab behind the title, even when there is no content, in order to equalize the column count for every csv "record")
1 (which is logically false since 1 should not be there but once, but I changed the element-to-be-imported, in order to "see what it brings"): but goes up to 2, so this "trick" might help to preserve the tree at least a little bit better, but manual adjustments should currently be needed here in any case.

Other example (first the original indentation, then UR's one):
1 1 ok (UR target is 0)
2 2
3 3
2 3!
3 4!
2 3!
3 4!
3 4!
2 3!
2 3!

So much for the problem I discovered; it seems that there is a code problem which makes that from the level directly beneath the source level of the "import data set" can't be reached anymore, from "below", in other words, if you don't count as I do, existant UR target = 0, but you count as 0 the source item of the tree to be imported, then level 1 can't be reached anymore from level 2: that should be easy to detect then. ;-)


When the schema is correct, the user can freely use CRLF (see above) as "new CSV row" (i.e. new record) separator, AND as newline within the content field (within "..." of course), the distinct use of CRLF there and LF here is of no practical interest/value.

But the user should be reminded that both UR's tree and content are ANSI, not UTF-8, so in order that titles and content (etc.) are rendered correctly, they must change their CSV's file's code page before import (and I have even encountered non-import, with the creation of multiple, empty "New Text" items, by trying to import in UTF-8 format); almost any editor can do that, even Windows' native "Notepad": It indicates the current code page in its status bar, and to change it if necessary, it's "File - Save AS": that dialog will then offer to change the "Encoding".

Also, at every import of another file, the user must set up the respective import columns again, even when they always stay at indentlevel, itemtitle and itemtext, so for importing several / multiple files, it's advisable to just rename the different files to import, into a common "dummy" file name, so that UR will preserve the target columns; as for the import's field separator (e.g. {tab}, you must re-select it every time anew.

And finally: Don't bother endlessly with "csv-enabled" editors and their possibly endless claims your code was faulty csv: Just use any "dumb" editor (as the aforementioned Notepad or similar), and check your schema visually, newlines within fields are simply "too much" for some allegedly "csv-ready" editors (names withheld here...).


EDIT: My try to do away with the "" was a failure, then, since they are needed to distinguish the crlf as newlines from the crlf as row separator; in theory, using LFs vs. CRLFs might do away with that necessity, but I think that will be futile, too.


EDIT 2:

In order to check if my numbering, starting at 1, was the culprit, I have done new tries, both with starting at 1, and starting at 0, and they both are identical.

Starting with 0 (so the existant UR target (=parent) item would count as "-1"):

0 ok
1 ok
2 ok
3 ok but now I go 2 up, not just 1:
1*: not 1 but 0 (!), and title/content not preserved, but "New Text; but creates a second item (2), with title and with content of the "1" item, and:
2**
1**
After the wrong "1*" and its unwanted "2" item described above, 4 new items instead of just 2 (the above "2**" and "1**", oscillating between 0 and 1 (!), the "0" being items "New Text" ones, and the "1" items with titles and content of the empty "0" ones; and no difference here between 2** and 1**, i.e. both become empty 0 items, with then title and content as 1.

Obviously, I have checked and rechecked my schema, which is not at fault. EmEditor's "show all", for the "", and also for control characters, will prominently display all occurrences, with green background, so that for a short text, it's not possible to overlook unwanted, or missing quotes, tabs, or CRLFs (it also shows CRLFs and simple LFs with different symbols, I just use CRLFs now).

(In order to exclude any possible interplay with my AHK script running, I stopped that script, and the faulty import results are unchanged but exactly as before.)

I now tried the above (0123121) without content, just left the tabs (which without content don't make much sense, except for indicating there is possible content, i.e. 3 columns instead of just 2: Now it works as expected. So the existence of content makes the algorithm choke, when "going up" in tree (i.e. going down in indent level number).


EDIT 3

Doing more work currently, will post again in some hours. UTF-8 to Ansi seems to be the culprit, in combination with EmEditor - purging ("save as" with new code page) in EmEditor obviously NOT sufficient, since a second purge (again "save" and message "saving will lose characters") in (Windows') Notepad then IS sufficient, ditto for avoiding EmEditor's purge, and just doing ONE purge, in Notepad.

Obviously, EmEditor leaves special CONTROL chars within the "purged" data, which it does NOT display, neither before nor afterwards, albeit the UTF-8 format is always without (!) BOM, whilst Notepad really purges the UTF-8 into then - functioning even at UR import as expected (I'm processing and checking numerous real-life files, by alternatively also adding the indent-level number to the titles, so that it doesn't vanish at import and can thus be visually checked for possible faults easily.

It seems that within Firefox' (UTF-8) html bookmarks export (in my case 22,000 items), and then even in "simple"-looking excerpts of just some bookmarks, AND then correctly reformatted for UR import, there are always hidden control chars which are left over from UTF-8 to Ansi IF the re-encoding is done in EmEditor, and which then upon UR import scramble that import ONLY and whenever the import goes UP in tree hierarchy, and near the "top" of the tree which is to be imported.

As said, will post again in some hours.

Last edited by Spliff; 05-29-2023 at 05:31 AM.
Reply With Quote