User talk:DPLA bot
Add topic|
|
strange dupe-tagging
[edit]Hi DPLA_bot, today you (resp. your bot) has speedy-tagged a number of files as duplicates, which - wrt their content - are not duplicates at all.
- File:Beckman pH Meter (Laboratory Model) brochure - DPLA - 020c022d23f3a2d8f2763cc62a0b1e52 (page 3).jpg
- File:The Mitchell Mining Company - DPLA - 025fddd447a1da7a2ad217b094e9adeb (page 58).jpg
- File:The Mitchell Mining Company - DPLA - 025fddd447a1da7a2ad217b094e9adeb (page 65).jpg
- File:The Mitchell Mining Company - DPLA - 025fddd447a1da7a2ad217b094e9adeb (page 83).jpg
- File:Helinews - DPLA - 02923a55b329403c0cdf3360634dff83 (page 5).jpg
- File:The Mitchell Mining Company - DPLA - 025fddd447a1da7a2ad217b094e9adeb (page 43).jpg
- File:The Mitchell Mining Company - DPLA - 025fddd447a1da7a2ad217b094e9adeb (page 48).jpg
- File:Gordon Research Conference on Radiation Chemistry, 1973 - DPLA - 008e26b3830e00d43df0c74e766df4f4 (page 2).jpg
- File:Lacquer's Like That - DPLA - 02a0d5c450cdf76f1b243add354b3b09 (page 2).jpg
- File:Bendix Gas Chromatograph and Mass Spectrometer - DPLA - 038dcd0c795fb0bb2161854148a20041 (page 3).jpg
- File:Advertisements for publications from Akademische Verlagsgesellschaft - DPLA - 036fb7d1843d427aa8f5ecd1bc3598a6 (page 56).jpg
- File:Ion source for Bendix Mass Spectrometer - DPLA - 053f7e0503c679fb5f02ac26f395a5fa (page 3).jpg
- File:Separator at Hercules Parlin plant - DPLA - 051dd50822bc7ce0081a1e0c49bc4255 (page 2).jpg
- File:Swanson TV Dinner Swiss Steak box - DPLA - 033da20203975e46c6b8f595f3419a71 (page 7).jpg
- File:Dairy Science Instruments - DPLA - 053fd7bf8c6961d027cb19e9a8388472 (page 3).jpg
- File:Dairy Science Instruments - DPLA - 053fd7bf8c6961d027cb19e9a8388472 (page 4).jpg
- File:Dairy Science Instruments - DPLA - 053fd7bf8c6961d027cb19e9a8388472 (page 5).jpg
- File:Instructions, Beckman-SDS Hybrid Fortran II - DPLA - 0592901dd0c77f46b2e8eeda32009223 (page 3).jpg
- File:Il Colombo - Regioni Esterne del Corpo, Scheletro, Sistema Vasale, Muscoli, Organi Interni - DPLA - 07c5a923ceba5a90c34d6b351117c176 (page 14).jpg
- File:Portrait of Wilhelm Salomon-Calvi - DPLA - 066abe16765f0b75365872d56056f617 (page 2).jpg
- File:The Hercules Mixer Volume 3, Number 9 - DPLA - 07b54350a77562d69036352fd07bc48b (page 25).jpg
- File:Die moderne Chemie - Eine Schilderung der chemischen Grossindustrie - DPLA - 060d86b33887f302506c9cabf1002d78 (page 5).jpg
- File:Die moderne Chemie - Eine Schilderung der chemischen Grossindustrie - DPLA - 060d86b33887f302506c9cabf1002d78 (page 2).jpg
- File:Instruction Manual, DK-2 Spectrophotometer - DPLA - 085f3cbd51ae2b55dd1cdbe300a1875c (page 31).jpg
- File:Instruction Manual, DK-2 Spectrophotometer - DPLA - 085f3cbd51ae2b55dd1cdbe300a1875c (page 32).jpg
- File:Silicone CoverUps - DPLA - 08b76c70f764ae074bd3f097ef17305f (page 6).jpg
- File:Il Colombo - Regioni Esterne del Corpo, Scheletro, Sistema Vasale, Muscoli, Organi Interni - DPLA - 07c5a923ceba5a90c34d6b351117c176 (page 20).jpg
- File:Il Colombo - Regioni Esterne del Corpo, Scheletro, Sistema Vasale, Muscoli, Organi Interni - DPLA - 07c5a923ceba5a90c34d6b351117c176 (page 22).jpg
- File:Tangee Face Powder - DPLA - 08da22f42f776a4e1ac3b9da29f5097e (page 3).jpg
- File:Box of See-Safe polyethylene plastic wrap - DPLA - 086f77c7f546a3fc30e9651596dfc46d (page 3).jpg
- File:Instruction Manual, DK-2 Spectrophotometer - DPLA - 085f3cbd51ae2b55dd1cdbe300a1875c (page 58).jpg
- File:Letters from Max Bredig to H. Jermain Creighton, December 6, 1938 - DPLA - 0b0c082feea6953d8c35f7a1538d7f44 (page 7).jpg
- File:Popular Zoology - DPLA - 0913ee55ee6a121a087545133544241c (page 276).jpg
- File:Wolldruck - DPLA - 00ea7ccb276819c95efde0aacffd15b9 (page 47).jpg
and many more. Is this really intentional? --Túrelio (talk) 07:59, 20 May 2026 (UTC)
- @Túrelio: That bot only do shit at the moment, moving files like File:Ronald Reagan and Douglas Ginsburg.jpg, uploaded by User, Change the file description and deleted all categories. Adding or updating the file description might be acceptable, but renaming the file is rather questionable, as is deleting categories – totally unnecessary, yes, even vandalism. זיו「Ziv」 • For love letters and other notes 17:11, 20 May 2026 (UTC)
- Addendum: The license is now also incorrect. זיו「Ziv」 • For love letters and other notes 17:43, 20 May 2026 (UTC)
- Yes, I’ve been working on cleaning this up as soon as I realized. I appreciate your patience in allowing me resolve it. I undid all the duplicate tags that shouldn’t have gone out, and will sort out the rest, too. I understand this is frustrating, but it’s certainly not ill-intentioned. The bot runs at high volume. Dominic (talk) 17:49, 20 May 2026 (UTC)
- Thank you very much for your answer @Dominic. I don't know if there are any other files like the one I mentioned above; I just noticed just this one. If so, please correct them, thank you. Best regards, זיו「Ziv」 • For love letters and other notes 18:04, 20 May 2026 (UTC)
- Also, I didn't mean to be short above, just trying to work quickly. I want to give a fuller explanation later, but I can tell already that what happened was I had some test code I was working on that then ran overnight before I realized it was not the regular code. My first priority is just rolling back any unintended edits. Dominic (talk) 18:09, 20 May 2026 (UTC)
- Thank you very much for your answer @Dominic. I don't know if there are any other files like the one I mentioned above; I just noticed just this one. If so, please correct them, thank you. Best regards, זיו「Ziv」 • For love letters and other notes 18:04, 20 May 2026 (UTC)
- Yes, I’ve been working on cleaning this up as soon as I realized. I appreciate your patience in allowing me resolve it. I undid all the duplicate tags that shouldn’t have gone out, and will sort out the rest, too. I understand this is frustrating, but it’s certainly not ill-intentioned. The bot runs at high volume. Dominic (talk) 17:49, 20 May 2026 (UTC)
- I agree about the renaming, why does the file name need to have the DPLA id in it? It just clutters the file name and makes it extra lengthy for no benefit. Traumnovelle (talk) 01:15, 27 May 2026 (UTC)
- @Traumnovelle: Thanks for asking! There are a few different benefits. One is that the institutions who provide these files are maintaining their metadata over time, and some times corrections or changes are made. The goal of our project is synchonize that dataset with the uploads on Commons so that they are also maintained here. Standardizing the image titles and metadata allows them to be maintained by bot. The only images the bot will touch are ones that are exact hash matches for the file from the institution's own catalog, meaning someone uploaded the file to Commons from the institution exactly as is. Sometimes they are uploaded here either with a title copied from Flickr, which wasn't really selected by the user anyway, and comes from the institution ultimately, or it might be a user-generated title which is sometimes not descriptive, and not trackable. I know what you mean about adding length to some file names, but it's not useless or for vanity, it's for a real purpose! Dominic (talk) 04:46, 27 May 2026 (UTC)
- @Traumnovelle: FWIW, working in categories with a lot of archival materials, I find it useful to know at a glance which ones came into Commons via DPLA bot. Sets my expectations very clearly for what will be the strengths and weaknesses of the metadata (in the broad sense of the latter). - Jmabel ! talk 14:31, 27 May 2026 (UTC)
- File names such as 'Woldruck - DPLA' would convey the same information in that regard. Traumnovelle (talk) 20:14, 27 May 2026 (UTC)
- There is more to it than that, though. There are over 10 million files from DPLA, so unique names are essential, and also the IDs are the only way would could associate them with the source metadata and maintain them. Dominic (talk) 03:53, 28 May 2026 (UTC)
- @Dominic i think you should make use of com:sdc instead of using the filename which is pretty damn useless and unimportant.
- if necessary, make a new property like d:Wikidata:Property proposal/Flickr Photo ID did.
- take a look at how flickr backfilled their metadata User:FlickypediaBackfillrBot.
- https://commons.wikimedia.org/w/index.php?diff=1222566342 you removed some commons users' work, which is not from the institutions you are working for. that should be avoided. RoyZuo (talk) 09:08, 30 May 2026 (UTC)
- there's actually already DPLA ID (P760). you should write that to sdc instead of doing anything with filename which is a maintenance burden and not stable. RoyZuo (talk) 09:13, 30 May 2026 (UTC)
- There is more to it than that, though. There are over 10 million files from DPLA, so unique names are essential, and also the IDs are the only way would could associate them with the source metadata and maintain them. Dominic (talk) 03:53, 28 May 2026 (UTC)
- File names such as 'Woldruck - DPLA' would convey the same information in that regard. Traumnovelle (talk) 20:14, 27 May 2026 (UTC)
- Addendum: The license is now also incorrect. זיו「Ziv」 • For love letters and other notes 17:43, 20 May 2026 (UTC)
There are a couple of things going on here to explain:
- The {{Duplicate}} tagging wasn't working as intended, and is, of course, only intended for non-controversial exact duplicate cleanup. I reverted them all, but sometimes images do get uploaded twice when a title or ID changes, so I will go back and make sure no actual duplicates are left in place.
- Sometimes we try to upload from a partner and find out that a Wikimedian already uploaded it. Or that we uploaded it before under a different name. In these cases, we want to maintain the metadata and title, so there is proper information and attribution, It might have been uploaded 10 years ago with minimal description. What I was testing was how we could do that cleanly. Overwriting valuable info like categories and copyright templates was not intentional, and is a new insight when we do implement anything like this.
- This was all test code not intended to be run without monitoring. Some of it was even unfinished. I wouldn't normally test like that, and never have. Please let me know if you catch anything I didn't mention above. Dominic (talk) 18:31, 20 May 2026 (UTC)
- Sorry, but this is still inaceptable. Like File:David Addington, Lucy Tutwiler, and Katie Wilson Wearing Body Armor at Hakim Compound in Red Zone, Baghdad - DPLA - e259228a1ef65e5fe7223a6f841c2e39.jpg. Old file name was good enough. The DPLA number can be included in the file description, but it doesn't have to be in the filename. Please stop moving correctly named files. זיו「Ziv」 • For love letters and other notes 20:21, 22 May 2026 (UTC)
- Addendum: Overwriting valid licenses is also a problem here. Originally, this was a Flickr upload with the license {{Flickr-no known copyright restrictions}}, now we have {{PD-US}}, which states that the image is not in the public domain in some countries, which was previously the case. The correct license should be {{PD-USGov}}. The bot should check which files it has inserted an incorrect license into and adjust them accordingly. זיו「Ziv」 • For love letters and other notes 21:54, 22 May 2026 (UTC)
- I hear why you’re frustrated, so I just want to clarify a few things. The bot is operating under the direction of the National Archives itself. Whatever they previously uploaded to Flickr is not as authoritative as the current metadata and identifier. We are only touching files that are exact hash matches for the file in the current NARA catalog. When one is detected, I am renaming it to provide the accurate current title, metadata, and identifier, and link to the new catalog. Most of these are many years old. I am doing our best to do it the right way, such as leaving redirects in place and using the User:CommonsDelinker process so no links are broken. If there is a PD-USGov template, it is retained. The Flickr tag is not necessary or helpful for a file which is directly copied from the official catalog, and the Flickr-no known copyright restrictions doesn’t really add anything, certainly not somehow worth retaining over the exact copyright statement from the institution does. Dominic (talk) 22:22, 22 May 2026 (UTC)
- Okay thank you. זיו「Ziv」 • For love letters and other notes 05:39, 23 May 2026 (UTC)
- Hello @Dominic:
- For your information: I added the Category:Files exempt from duplicate tagging to both files File:(Sherman case docket) - DPLA - 2e9d3ca5d4c772d9e3da903fe4812265 (page 1).jpg and File:(Sherman case docket) - DPLA - 2e9d3ca5d4c772d9e3da903fe4812265 (page 12).jpg. Otherwise, OptimusPrimeBot would recognize them as duplicates again and tag them accordingly. This prevents them from being deleted again. Best regards, זיו「Ziv」 • For love letters and other notes 16:04, 24 May 2026 (UTC)
- Okay thank you. זיו「Ziv」 • For love letters and other notes 05:39, 23 May 2026 (UTC)
- I hear why you’re frustrated, so I just want to clarify a few things. The bot is operating under the direction of the National Archives itself. Whatever they previously uploaded to Flickr is not as authoritative as the current metadata and identifier. We are only touching files that are exact hash matches for the file in the current NARA catalog. When one is detected, I am renaming it to provide the accurate current title, metadata, and identifier, and link to the new catalog. Most of these are many years old. I am doing our best to do it the right way, such as leaving redirects in place and using the User:CommonsDelinker process so no links are broken. If there is a PD-USGov template, it is retained. The Flickr tag is not necessary or helpful for a file which is directly copied from the official catalog, and the Flickr-no known copyright restrictions doesn’t really add anything, certainly not somehow worth retaining over the exact copyright statement from the institution does. Dominic (talk) 22:22, 22 May 2026 (UTC)
- Addendum: Overwriting valid licenses is also a problem here. Originally, this was a Flickr upload with the license {{Flickr-no known copyright restrictions}}, now we have {{PD-US}}, which states that the image is not in the public domain in some countries, which was previously the case. The correct license should be {{PD-USGov}}. The bot should check which files it has inserted an incorrect license into and adjust them accordingly. זיו「Ziv」 • For love letters and other notes 21:54, 22 May 2026 (UTC)
File:"Bear poster" (Disney) - DPLA - 8b59300ed1769737b7eac277ce9fc5fa.gif has been nominated for deletion at
This is a deletion request for the community to discuss whether the nominated page should be kept or deleted. Please voice your opinion in the linked request above. Thank you very much! If you created this file, please note that the fact that it has been proposed for deletion does not necessarily mean that we do not value your kind contribution. It simply means that one person believes that there is some specific problem with it, such as a copyright issue. Please see Commons:But it's my own work! for a guide on how to address these issues. |
(Oinkers42) (talk) 20:38, 29 May 2026 (UTC)
Index card uploads
[edit]The bot is currently uploading individual index cards from the NARA series "Universal Newsreels' Subject Card Catalog", which consists of over half a million images (514,742 to be exact), over 120k uploaded so far. [1] This series consists of 278 file units, and each of those file units is also available as a PDF file, e.g. [2]. Why are we not uploading those 278 PDF files, which contain everything instead of half a million individual JPG files? The PDF files are also OCR'd, so easily searchable using Special:Search, and of course infinitely more easy to maintain on a platform like Commons.
Note that there are much, much larger collections of index cards in the national archives, e.g. this, where we can choose between uploading +16,000,000 million images or 5,000 PDF files. ~TheImaCow (talk) 14:42, 30 May 2026 (UTC)