A greater multilingual vision language encoder

ntntnttcross modal similarityntnn

when i input an img and a text to model

def get_output(url, text):n    text = "a photograph of two cats"n    url = "http://images.cocodataset.org/val2017/000000039769.jpg"n    image = load_image(url)n    inputs = processor(text=[text], images=image, padding="max_length", return_tensors="pt").to(model.device)n    with torch.no_grad():n        output = model(**inputs)n    return outputn

the output logits is
SiglipOutput(loss=None, logits_per_image=tensor([[-15.5217]], device=”cuda:0″), logits_per_text=tensor([[-15.5217]], device=”cuda:0″)

after sigmoid the output is ~0.

How am i able to fix it?

n”,”updatedAt”:”2025-03-25T12:28:54.228Z”,”writer”:{“_id”:”669e428f2ee9fbe7fd0b8591″,”avatarUrl”:”/avatars/b585af195ad07355f4a253d42a02b955.svg”,”fullname”:”baishanduan”,”name”:”browallia”,”type”:”user”,”isPro”:false,”isHf”:false,”isHfAdmin”:false,”isMod”:false}},”numEdits”:0,”identifiedLanguage”:{“language”:”en”,”probability”:0.42783743143081665},”editors”:[“browallia”],”editorAvatarUrls”:[“/avatars/b585af195ad07355f4a253d42a02b955.svg”],”reactions”:[],”isReport”:false},”replies”:[{“id”:”67ee2678201127215ac2defa”,”author”:{“_id”:”65b8782c8237bf70e454bd91″,”avatarUrl”:”/avatars/db7d96e1ec0940e7fb5e934bbe0f0057.svg”,”fullname”:”potato”,”name”:”potatowarriors”,”type”:”user”,”isPro”:false,”isHf”:false,”isHfAdmin”:false,”isMod”:false},”createdAt”:”2025-04-03T06:11:04.000Z”,”type”:”comment”,”data”:{“edited”:false,”hidden”:false,”latest”:{“raw”:”I’m having that problem too.”,”html”:”

I’m having that problem too.

n”,”updatedAt”:”2025-04-03T06:11:04.181Z”,”author”:{“_id”:”65b8782c8237bf70e454bd91″,”avatarUrl”:”/avatars/db7d96e1ec0940e7fb5e934bbe0f0057.svg”,”fullname”:”potato”,”name”:”potatowarriors”,”type”:”user”,”isPro”:false,”isHf”:false,”isHfAdmin”:false,”isMod”:false}},”numEdits”:0,”identifiedLanguage”:{“language”:”en”,”probability”:0.9985722303390503},”editors”:[“potatowarriors”],”editorAvatarUrls”:[“/avatars/db7d96e1ec0940e7fb5e934bbe0f0057.svg”],”reactions”:[],”isReport”:false,”parentCommentId”:”67e2a1865abc9ba71a104b8f”}}]},{“id”:”67eae73549483d86229f23a8″,”writer”:{“_id”:”67001b5d0ccb3ca95876c1eb”,”avatarUrl”:”/avatars/d1f73374f35702b432b4c1bc4845d609.svg”,”fullname”:”Praveen Annamalai Nathan”,”name”:”praveenNathan”,”type”:”user”,”isPro”:false,”isHf”:false,”isHfAdmin”:false,”isMod”:false},”createdAt”:”2025-03-31T19:04:21.000Z”,”type”:”comment”,”data”:{“edited”:false,”hidden”:false,”latest”:{“raw”:”Is there any examples provided for fine-tuning on a custom dataset?”,”html”:”

Is there any examples provided for fine-tuning on a custom dataset?

n”,”updatedAt”:”2025-03-31T19:04:21.473Z”,”writer”:{“_id”:”67001b5d0ccb3ca95876c1eb”,”avatarUrl”:”/avatars/d1f73374f35702b432b4c1bc4845d609.svg”,”fullname”:”Praveen Annamalai Nathan”,”name”:”praveenNathan”,”type”:”user”,”isPro”:false,”isHf”:false,”isHfAdmin”:false,”isMod”:false}},”numEdits”:0,”identifiedLanguage”:{“language”:”en”,”probability”:0.6357212066650391},”editors”:[“praveenNathan”],”editorAvatarUrls”:[“/avatars/d1f73374f35702b432b4c1bc4845d609.svg”],”reactions”:[],”isReport”:false},”replies”:[{“id”:”67f4d1c4df6757586b595f5a”,”author”:{“_id”:”608aabf24955d2bfc3cd99c6″,”avatarUrl”:”https://cdn-avatars.huggingface.co/v1/production/uploads/608aabf24955d2bfc3cd99c6/-YxmtpzEmf3NKOTktODRP.jpeg”,”fullname”:”Aritra Roy Gosthipaty”,”name”:”ariG23498″,”type”:”user”,”isPro”:true,”isHf”:true,”isHfAdmin”:false,”isMod”:false,”followerCount”:489},”createdAt”:”2025-04-08T07:35:32.000Z”,”type”:”comment”,”data”:{“edited”:false,”hidden”:false,”latest”:{“raw”:”@nielsr has a great notebook on fine tuning siglip and similar models.nnhttps://github.com/NielsRogge/Transformers-Tutorials/tree/master/SigLIP”,”html”:”

nn@nielsrnt has a great notebook on fine tuning siglip and similar models.

https://github.com/NielsRogge/Transformers-Tutorials/tree/master/SigLIP

n”,”updatedAt”:”2025-04-08T07:35:32.326Z”,”author”:{“_id”:”608aabf24955d2bfc3cd99c6″,”avatarUrl”:”https://cdn-avatars.huggingface.co/v1/production/uploads/608aabf24955d2bfc3cd99c6/-YxmtpzEmf3NKOTktODRP.jpeg”,”fullname”:”Aritra Roy Gosthipaty”,”name”:”ariG23498″,”type”:”user”,”isPro”:true,”isHf”:true,”isHfAdmin”:false,”isMod”:false,”followerCount”:489}},”numEdits”:0,”identifiedLanguage”:{“language”:”en”,”probability”:0.8140711784362793},”editors”:[“ariG23498″],”editorAvatarUrls”:[“https://cdn-avatars.huggingface.co/v1/production/uploads/608aabf24955d2bfc3cd99c6/-YxmtpzEmf3NKOTktODRP.jpeg”],”reactions”:[{“reaction”:”👍”,”users”:[“praveenNathan”,”OJ-1″],”count”:2}],”isReport”:false,”parentCommentId”:”67eae73549483d86229f23a8″}},{“id”:”6846f18e342582107f29f4a4″,”writer”:{“_id”:”65080401074aa8f310df45ba”,”avatarUrl”:”https://cdn-avatars.huggingface.co/v1/production/uploads/65080401074aa8f310df45ba/i3aoCVmjWtNy0uvSuQc3d.jpeg”,”fullname”:”Miguel Alba”,”name”:”malba96″,”type”:”user”,”isPro”:false,”isHf”:false,”isHfAdmin”:false,”isMod”:false},”createdAt”:”2025-06-09T14:37:02.000Z”,”type”:”comment”,”data”:{“edited”:true,”hidden”:false,”latest”:{“raw”:”I believe that example may be very specific to multi-label classification. @praveenNathan if you desire to proceed the “pre-training” and use it for zero-shot classification (or create higher vector representations to your specific domain) you may check open-clip implementation https://github.com/mlfoundations/open_clip … they’ve the losses there for v1 “,”html”:”

I believe that example may be very specific to multi-label classification. nn@praveenNathannt if you desire to proceed the “pre-training” and use it for zero-shot classification (or create higher vector representations to your specific domain) you may check open-clip implementation https://github.com/mlfoundations/open_clip … they’ve the losses there for v1

n”,”updatedAt”:”2025-06-09T14:37:44.887Z”,”writer”:{“_id”:”65080401074aa8f310df45ba”,”avatarUrl”:”https://cdn-avatars.huggingface.co/v1/production/uploads/65080401074aa8f310df45ba/i3aoCVmjWtNy0uvSuQc3d.jpeg”,”fullname”:”Miguel Alba”,”name”:”malba96″,”type”:”user”,”isPro”:false,”isHf”:false,”isHfAdmin”:false,”isMod”:false}},”numEdits”:1,”identifiedLanguage”:{“language”:”en”,”probability”:0.9100576043128967},”editors”:[“malba96″],”editorAvatarUrls”:[“https://cdn-avatars.huggingface.co/v1/production/uploads/65080401074aa8f310df45ba/i3aoCVmjWtNy0uvSuQc3d.jpeg”],”reactions”:[{“reaction”:”😎”,”users”:[“OJ-1″],”count”:1}],”isReport”:false,”parentCommentId”:”67eae73549483d86229f23a8″}}]},{“id”:”68c28ffb50765d59e20e581a”,”writer”:{“_id”:”63329ec6c6f6dcd33333cfb4″,”avatarUrl”:”/avatars/e45438e11e1566259e90a8afb9f82e9c.svg”,”fullname”:”Patrick Knab”,”name”:”P4ddyki”,”type”:”user”,”isPro”:false,”isHf”:false,”isHfAdmin”:false,”isMod”:false},”createdAt”:”2025-09-11T09:01:47.000Z”,”type”:”comment”,”data”:{“edited”:false,”hidden”:false,”latest”:{“raw”:”Hello everyone,nam I flawed or is the zero shot classification probably not working. You’ll be able to take a take a look at the provided example, we replaced the unique image with one other one and added the text “golf” to the text corpus. Also with other examples, and different backbones, it looks like that zero shot will not be really working. nn(try it here: https://colab.research.google.com/github/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/siglip-zero-shot-image-classification/siglip-zero-shot-image-classification.ipynb)nnAlso the code provided here: https://huggingface.co/docs/transformers/foremost/en/model_doc/siglip2 never works, at the very least not with using sigmoid. nnAny recommendations?nn![Screenshot 2025-09-11 at 10.56.23.png](https://cdn-uploads.huggingface.co/production/uploads/63329ec6c6f6dcd33333cfb4/NL6IGWudBzvisZy7j7gvR.png)nn”,”html”:”

Hello everyone,
am I flawed or is the zero shot classification probably not working. You’ll be able to take a take a look at the provided example, we replaced the unique image with one other one and added the text “golf” to the text corpus. Also with other examples, and different backbones, it looks like that zero shot will not be really working.

(try it here: https://colab.research.google.com/github/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/siglip-zero-shot-image-classification/siglip-zero-shot-image-classification.ipynb)

Also the code provided here: https://huggingface.co/docs/transformers/foremost/en/model_doc/siglip2 never works, at the very least not with using sigmoid.

Any recommendations?

n”,”updatedAt”:”2025-09-11T09:01:47.958Z”,”writer”:{“_id”:”63329ec6c6f6dcd33333cfb4″,”avatarUrl”:”/avatars/e45438e11e1566259e90a8afb9f82e9c.svg”,”fullname”:”Patrick Knab”,”name”:”P4ddyki”,”type”:”user”,”isPro”:false,”isHf”:false,”isHfAdmin”:false,”isMod”:false}},”numEdits”:0,”identifiedLanguage”:{“language”:”en”,”probability”:0.8212398886680603},”editors”:[“P4ddyki”],”editorAvatarUrls”:[“/avatars/e45438e11e1566259e90a8afb9f82e9c.svg”],”reactions”:[],”isReport”:false}}],”status”:”open”,”isReport”:false,”pinned”:false,”locked”:false,”collection”:”community_blogs”},”contextAuthors”:[“ariG23498″,”merve”,”qubvel-hf”],”primaryEmailConfirmed”:false,”discussionRole”:0,”acceptLanguages”:[“*”],”withThread”:true,”cardDisplay”:false,”repoDiscussionsLocked”:false}”>

Source link