Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong database encodings #146

Open
MartinHinz opened this issue Aug 29, 2021 · 6 comments
Open

wrong database encodings #146

MartinHinz opened this issue Aug 29, 2021 · 6 comments
Labels

Comments

@MartinHinz
Copy link
Contributor

>>> file -i PA20110001_S01.txt                                     
PA20110001_S01.txt: text/plain; charset=iso-8859-1

This leads to "wrong" site names.

nevrome added a commit that referenced this issue Aug 29, 2021
@nevrome
Copy link
Member

nevrome commented Aug 29, 2021

Should be fixed in v2.4.2

@MartinHinz
Copy link
Contributor Author

Thanks! Same is true for palmisano. Will report other errors as they appear.

@nevrome
Copy link
Member

nevrome commented Aug 30, 2021

Palmisano is probably superseded by aida (#144) soon. But keep them coming anyway. And feel free to make PRs right away.

@nevrome nevrome changed the title pacea is imported as utf8, but file seems to be iso-8859-1 wrong database encodings Sep 2, 2021
@nevrome
Copy link
Member

nevrome commented Sep 2, 2021

Hm - ok - palmisano is not superseded by aida.

So what is the correct encoding of the radiocarbon.csv file in the palmisano db zip archive?

file -i radiocarbon.csv 
radiocarbon.csv: text/plain; charset=unknown-8bit

That's not helpful.

Not many site names are obviously affected. I only see two, so we could overwrite them manually.

Grotta dell�۪Orso
Osteria dell�۪Osa Necropolis

A lot of the citation strings are heavily broken, though.

@MartinHinz
Copy link
Contributor Author

MartinHinz commented Sep 10, 2021

I have tried 228 Encodings, none of them worked for the first line source (Skeates). I assume, it is simply gibberish?

The following did not work:

  [1] "437"                 "850"                 "852"                 "855"                
  [5] "857"                 "860"                 "861"                 "862"                
  [9] "863"                 "865"                 "866"                 "869"                
 [13] "ARMSCII-8"           "ATARI"               "ATARIST"             "CP-GR"              
 [17] "CP-IS"               "CP1046"              "CP1124"              "CP1125"             
 [21] "CP1129"              "CP1133"              "CP1163"              "CP1250"             
 [25] "CP1251"              "CP1252"              "CP1254"              "CP1256"             
 [29] "CP1257"              "CP1258"              "CP154"               "CP437"              
 [33] "CP737"               "CP775"               "CP819"               "CP850"              
 [37] "CP852"               "CP853"               "CP855"               "CP857"              
 [41] "CP858"               "CP860"               "CP861"               "CP862"              
 [45] "CP863"               "CP864"               "CP865"               "CP866"              
 [49] "CP869"               "CP922"               "CP932"               "CP943"              
 [53] "CSHPROMAN8"          "CSIBM1163"           "CSIBM855"            "CSIBM857"           
 [57] "CSIBM860"            "CSIBM861"            "CSIBM863"            "CSIBM864"           
 [61] "CSIBM865"            "CSIBM866"            "CSIBM869"            "CSISOLATIN1"        
 [65] "CSISOLATIN2"         "CSISOLATIN3"         "CSISOLATIN4"         "CSISOLATIN5"        
 [69] "CSISOLATIN6"         "CSISOLATINCYRILLIC"  "CSKOI8R"             "CSMACINTOSH"        
 [73] "CSPC775BALTIC"       "CSPC850MULTILINGUAL" "CSPC862LATINHEBREW"  "CSPC8CODEPAGE437"   
 [77] "CSPCP852"            "CSPTCP154"           "CSSHIFTJIS"          "CSVISCII"           
 [81] "CYRILLIC"            "CYRILLIC-ASIAN"      "GEORGIAN-ACADEMY"    "GEORGIAN-PS"        
 [85] "HP-ROMAN8"           "HZ"                  "HZ-GB-2312"          "IBM-1163"           
 [89] "IBM-CP1133"          "IBM1163"             "IBM437"              "IBM775"             
 [93] "IBM819"              "IBM850"              "IBM852"              "IBM855"             
 [97] "IBM857"              "IBM860"              "IBM861"              "IBM862"             
[101] "IBM863"              "IBM864"              "IBM865"              "IBM866"             
[105] "IBM869"              "ISO_8859-1"          "ISO_8859-1:1987"     "ISO_8859-10"        
[109] "ISO_8859-10:1992"    "ISO_8859-13"         "ISO_8859-14"         "ISO_8859-14:1998"   
[113] "ISO_8859-15"         "ISO_8859-15:1998"    "ISO_8859-16"         "ISO_8859-16:2001"   
[117] "ISO_8859-2"          "ISO_8859-2:1987"     "ISO_8859-3"          "ISO_8859-3:1988"    
[121] "ISO_8859-4"          "ISO_8859-4:1988"     "ISO_8859-5"          "ISO_8859-5:1988"    
[125] "ISO_8859-9"          "ISO_8859-9:1989"     "ISO-8859-1"          "ISO-8859-10"        
[129] "ISO-8859-13"         "ISO-8859-14"         "ISO-8859-15"         "ISO-8859-16"        
[133] "ISO-8859-2"          "ISO-8859-3"          "ISO-8859-4"          "ISO-8859-5"         
[137] "ISO-8859-9"          "ISO-CELTIC"          "ISO-IR-100"          "ISO-IR-101"         
[141] "ISO-IR-109"          "ISO-IR-110"          "ISO-IR-144"          "ISO-IR-148"         
[145] "ISO-IR-157"          "ISO-IR-179"          "ISO-IR-199"          "ISO-IR-203"         
[149] "ISO-IR-226"          "ISO8859-1"           "ISO8859-10"          "ISO8859-13"         
[153] "ISO8859-14"          "ISO8859-15"          "ISO8859-16"          "ISO8859-2"          
[157] "ISO8859-3"           "ISO8859-4"           "ISO8859-5"           "ISO8859-9"          
[161] "JAVA"                "KOI8-R"              "KOI8-RU"             "KOI8-T"             
[165] "KOI8-U"              "L1"                  "L10"                 "L2"                 
[169] "L3"                  "L4"                  "L5"                  "L6"                 
[173] "L7"                  "L8"                  "LATIN-9"             "LATIN1"             
[177] "LATIN10"             "LATIN2"              "LATIN3"              "LATIN4"             
[181] "LATIN5"              "LATIN6"              "LATIN7"              "LATIN8"             
[185] "MAC"                 "MACCENTRALEUROPE"    "MACCROATIAN"         "MACCYRILLIC"        
[189] "MACGREEK"            "MACHEBREW"           "MACICELAND"          "MACINTOSH"          
[193] "MACROMAN"            "MACROMANIA"          "MACTHAI"             "MACTURKISH"         
[197] "MACUKRAINE"          "MS_KANJI"            "MS-ANSI"             "MS-ARAB"            
[201] "MS-CYRL"             "MS-EE"               "MS-TURK"             "MULELAO-1"          
[205] "NEXTSTEP"            "PT154"               "PTCP154"             "R8"                 
[209] "RISCOS-LATIN1"       "ROMAN8"              "SHIFT_JIS"           "SHIFT_JISX0213"     
[213] "SHIFT-JIS"           "SJIS"                "TCVN"                "TCVN-5712"          
[217] "TCVN5712-1"          "TCVN5712-1:1993"     "VISCII"              "VISCII1.1-1"        
[221] "WINBALTRIM"          "WINDOWS-1250"        "WINDOWS-1251"        "WINDOWS-1252"       
[225] "WINDOWS-1254"        "WINDOWS-1256"        "WINDOWS-1257"        "WINDOWS-1258" 

@nevrome
Copy link
Member

nevrome commented Sep 10, 2021

You're my man, Martin! Impressive dedication! Let's ask the creator of this database then.

Hey, @apalmisano82, sorry for summoning you once again to this repository. We have some trouble with your dataset "Regional Demographic Trends and Settlement Patterns in Central Italy: Archaeological Sites and Radiocarbon Dates". So far we assumed this data to be UTF-8 encoded, but this does not seem to be right. We're getting a lot of broken symbols, especially in the literature column. Martin now tried a ton of other possible encodings, but none of them match.

  1. Do you remember which encoding you used or do you have another explanation for this issue?
  2. I checked if all of this data is already in AIDA, so that we could fall back on that. But this also does not seem to be the case. This old dataset seems to have some dates not in AIDA yet (or at least not with the same lab numbers... 🤔). Is this on purpose?

As always: Thanks for your help!

@nevrome nevrome added the bug label Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants