{"id":627,"date":"2026-05-03T19:59:15","date_gmt":"2026-05-03T16:59:15","guid":{"rendered":"https:\/\/m4.ist\/index.php\/2026\/05\/03\/ollama-vllm-llamacpp-coklu-ollama-vllm-llamacpp-tek\/"},"modified":"2026-05-03T19:59:15","modified_gmt":"2026-05-03T16:59:15","slug":"ollama-vllm-llamacpp-coklu-ollama-vllm-llamacpp-tek","status":"publish","type":"post","link":"https:\/\/m4.ist\/index.php\/2026\/05\/03\/ollama-vllm-llamacpp-coklu-ollama-vllm-llamacpp-tek\/","title":{"rendered":"Ollama vllm llama cpp: 2026 Guide"},"content":{"rendered":"<h1>Ollama, vLLM, llama.cpp: Homelab Operat\u00f6r\u00fc \u0130\u00e7in \u00c7oklu GPU Ger\u00e7ek\u00e7i Analizi<\/h1>\n<p>Bu b\u00f6l\u00fcm, Ollama, vLLM ve llama.cpp&#8217;ye odaklan\u0131r; i\u00e7eri\u011fi:<\/p>\n<ul>\n<li><a href=\"#section-1\">Ollama, vLLM, llama.cpp: \u00c7oklu GPU Ger\u00e7eklik Kontrol\u00fc: Bant Geni\u015fli\u011fi Darbo\u011fazlar\u0131 ve Enerji Maliyetleri<\/a><\/li>\n<li><a href=\"#section-2\">PCIe Bant Geni\u015fli\u011fi Darbo\u011faz\u0131<\/a><\/li>\n<li><a href=\"#section-3\">G\u00fc\u00e7 ve Is\u0131 Denklemi<\/a><\/li>\n<li><a href=\"#section-4\">Neden &#8220;Daha Fazla GPU&#8221; Her Zaman &#8220;Daha H\u0131zl\u0131&#8221; Demektir?<\/a><\/li>\n<li><a href=\"#section-5\">Ger\u00e7ek Hayatta \u0130stikrars\u0131zl\u0131\u011f\u0131n Bedeli<\/a><\/li>\n<li><a href=\"#section-6\">Ollama: \u00c7oklu Cihaz Y\u00fcr\u00fctmesi \u0130\u00e7in Pratik Bir Sar\u0131c\u0131<\/a><\/li>\n<li><a href=\"#section-7\">Ollama&#8217;n\u0131n Model B\u00f6lme Mekanizmas\u0131<\/a><\/li>\n<li><a href=\"#section-8\">Da\u011f\u0131t\u0131m ve Yap\u0131land\u0131rman\u0131n Kolayl\u0131\u011f\u0131<\/a><\/li>\n<li><a href=\"#section-9\">E\u015f Zamanl\u0131l\u0131k K\u0131s\u0131tlamalar\u0131<\/a><\/li>\n<li><a href=\"#section-10\">Operat\u00f6r Notlar\u0131: &#8220;Ollama&#8221; Fark\u0131<\/a><\/li>\n<li><a href=\"#section-11\">vLLM: Y\u00fcksek E\u015f Zamanl\u0131l\u0131k Y\u00fckleri \u0130\u00e7in \u0130\u015flemcan\u0131 Canavar<\/a><\/li>\n<li><a href=\"#section-12\">PagedAttention Mekanikleri<\/a><\/li>\n<li><a href=\"#section-13\">Da\u011f\u0131t\u0131k Servis Yetenekleri<\/a><\/li>\n<li><a href=\"#section-14\">Y\u00fcksek E\u015f Zamanl\u0131l\u0131k vs. D\u00fc\u015f\u00fck Gecikme<\/a><\/li>\n<li><a href=\"#section-15\">Operat\u00f6r Notlar\u0131: Karma\u015f\u0131kl\u0131k Vergisi<\/a><\/li>\n<li><a href=\"#section-16\">llama.cpp: Gran\u00fcler Kontrol \u0130\u00e7in Sa\u011flam Motor<\/a><\/li>\n<li><a href=\"#section-17\">GGUF Format\u0131 ve Quantization<\/a><\/li>\n<li><a href=\"#section-18\">Manuel Tensor B\u00f6lme ve Katman Kontrol\u00fc<\/a><\/li>\n<li><a href=\"#section-19\">D\u00fc\u015f\u00fck Seviye Kontrol ve Esneklik<\/a><\/li>\n<li><a href=\"#section-20\">Operat\u00f6r Notlar\u0131: Bak\u0131m Y\u00fck\u00fc<\/a><\/li>\n<li><a href=\"#section-21\">Donan\u0131m Topolojisi \u00d6nemlidir: NVLink vs. PCIe Darbo\u011fazlar\u0131<\/a><\/li>\n<li><a href=\"#section-22\">NVLink Avantaj\u0131 (ve Yoklu\u011fu)<\/a><\/li>\n<li><a href=\"#section-23\">PCIe Gen3 vs. Gen4 vs. Gen5<\/a><\/li>\n<li><a href=\"#section-24\">Ba\u011flam De\u011fi\u015ftirmeye Gecikmenin Etkisi<\/a><\/li>\n<li><a href=\"#section-25\">Kar\u015f\u0131la\u015ft\u0131rma: Veri Merkezi vs. T\u00fcketici<\/a><\/li>\n<li><a href=\"#section-26\">Bellek Hiyerar\u015fisi: VRAM, RAM ve Takas Riskleri<\/a><\/li>\n<li><a href=\"#section-27\">VRAM Hiyerar\u015fisi<\/a><\/li>\n<li><a href=\"#section-28\">CPU Offloading ve RAM Takas\u0131<\/a><\/li>\n<li><a href=\"#section-29\">Swap Dosyalar\u0131n\u0131n Riski<\/a><\/li>\n<li><a href=\"#section-30\">Operat\u00f6r Notlar\u0131: &#8220;Swap&#8221; Pani\u011fi<\/a><\/li>\n<li><a href=\"#section-31\">Da\u011f\u0131t\u0131mdan \u00d6nce Donan\u0131m Denetimi: \u0130stikrar \u0130\u00e7in Kontrol Listesi<\/a><\/li>\n<li><a href=\"#section-32\">\u00c7oklu GPU Yap\u0131land\u0131rmas\u0131 Do\u011frulama: Ad\u0131m Ad\u0131m Rehber<\/a><\/li>\n<li><a href=\"#section-33\">\u00c7\u0131kar\u0131m Motoru \u00d6zellik Matrisi<\/a><\/li>\n<li><a href=\"#section-34\">Topolojiye G\u00f6re \u00c7oklu GPU Performans Beklentileri<\/a><\/li>\n<li><a href=\"#section-35\">Donan\u0131m Ar\u0131za Modu Analizi<\/a><\/li>\n<li><a href=\"#section-36\">Sorun Giderme: \u00c7oklu GPU Hata Verdi\u011finde Ne Olur?<\/a><\/li>\n<li><a href=\"#section-37\">Sorun 1: GPU 0&#8217;da &#8220;CUDA out of memory&#8221;<\/a><\/li>\n<li><a href=\"#section-38\">Sorun 2: Performans tek GPU&#8217;dan daha yava\u015f<\/a><\/li>\n<li><a href=\"#section-39\">Sorun 3: Bir GPU bo\u015fta, di\u011feri dolu<\/a><\/li>\n<li><a href=\"#section-40\">Sorun 4: Sistem birka\u00e7 dakika sonra \u00e7\u00f6ker<\/a><\/li>\n<li><a href=\"#section-41\">Pratik Senaryo: \u0130ki RTX 3090 \u00dczerinde 70B Model<\/a><\/li>\n<li><a href=\"#section-42\">S\u0131k Sorulan Sorular<\/a><\/li>\n<li><a href=\"#section-43\">Sonu\u00e7: \u0130\u015fin Do\u011fru Aletini Se\u00e7mek<\/a><\/li>\n<\/ul>\n<p>Tezgah\u0131n \u00f6n\u00fcnde iki veya \u00fc\u00e7 adet t\u00fcketici s\u0131n\u0131f\u0131 GPU ve bir avu\u00e7 fatura ile duruyorsan\u0131z, donan\u0131m\u0131 y\u0131\u011f\u0131nlamak an\u0131nda sihirli sonu\u00e7lar verece\u011fine inan\u0131yor olabilirsiniz. Bu do\u011fru de\u011fil. T\u00fcketici donan\u0131m\u0131nda <strong>Ollama, vLLM, llama.cpp<\/strong> \u00e7al\u0131\u015ft\u0131rman\u0131n ger\u00e7e\u011fi, d\u00fc\u015fen verim, termal darbelendirme ve c\u00fczdan\u0131n\u0131z\u0131 token \u00fcretmekten daha h\u0131zl\u0131 eriten bant geni\u015fli\u011fi darbo\u011fazlar\u0131yla dolu bir oyundur. G\u00fc\u00e7 da\u011f\u0131t\u0131m \u00fcniteleri veya so\u011futma y\u00fckseltmeleri i\u00e7in bir kuru\u015f daha harcamadan \u00f6nce, PCIe bus&#8217;un bir otoyol de\u011fil, bir darbo\u011faz oldu\u011funu anlaman\u0131z gerekir.<\/p>\n<p>\u0130\u015fletme donan\u0131m\u0131n\u0131za ikinci bir GPU eklemek do\u011frusal bir y\u00fckseltme de\u011fildir. Yeni ar\u0131za modlar\u0131n\u0131 devreye sokan karma\u015f\u0131k bir entegrasyon g\u00f6revidir. Veri merkezlerinde NVLink bu a\u00e7\u0131\u011f\u0131 kapat\u0131r; ancak homelab ortam\u0131nda PCIe Gen3 veya Gen4 \u015feritleri verinin s\u00fcrekli geri ve ileri kar\u0131\u015ft\u0131r\u0131lmas\u0131n\u0131 zorunlu k\u0131lar, bu da ikinci kart\u0131 varl\u0131ktan ziyade bir y\u00fck haline getirir. 70 milyar parametreli bir modeli iki adet 12GB&#8217;l\u0131k kart \u00fczerinde \u00e7al\u0131\u015ft\u0131r\u0131yor veya RTX 3090&#8217;lar \u00fczerinde 48GB VRAM&#8217;i yayarak devasa bir ba\u011flam penceresini s\u0131k\u0131\u015ft\u0131rmaya \u00e7al\u0131\u015f\u0131yor olabilirsiniz. Bu k\u0131s\u0131tlamalar\u0131n fizi\u011fi, hangi motoru \u00e7al\u0131\u015ft\u0131rman\u0131z gerekti\u011fini belirler.<\/p>\n<p>Bu k\u0131lavuz, &#8220;bulut-native&#8221; benchmarklar\u0131n\u0131n hype&#8217;inden s\u0131yr\u0131l\u0131r. Elektrik faturas\u0131na, oturma odan\u0131zdaki fan g\u00fcr\u00fclt\u00fcs\u00fcne ve ev internet ba\u011flant\u0131s\u0131 \u00fczerinde 7\/24 \u00e7al\u0131\u015fan bir sistemin istikrar\u0131na bak\u0131yoruz. Ollama, vLLM veya llama.cpp basitli\u011finden ziyade, vLLM&#8217;in ham i\u015fleyi\u015f kapasitesini veya llama.cpp&#8217;nin gran\u00fcler kontrol\u00fcn\u00fc se\u00e7seniz bile, se\u00e7iminiz fiziksel donan\u0131m topolojinizle uyumlu olmal\u0131d\u0131r.<\/p>\n<p>Motorlar\u0131n katmanlar\u0131 nas\u0131l b\u00f6ld\u00fc\u011f\u00fcn\u00fc, bellek bant geni\u015fli\u011finde nerede t\u0131kanacaklar\u0131n\u0131 ve fanlar %100 h\u0131zda d\u00f6nd\u00fc\u011f\u00fcnde sistemde RAM&#8217;e takas yapmaya ba\u015flad\u0131\u011f\u0131nda neler olaca\u011f\u0131n\u0131 detayland\u0131raca\u011f\u0131z. Bu teorik bir genel bak\u0131\u015f de\u011fil, elektrik faturas\u0131n\u0131 \u00f6deyen operat\u00f6r i\u00e7in bir hayatta kalma k\u0131lavuzudur.<\/p>\n<h2>Ollama vllm llama cpp: \u00c7oklu GPU Ger\u00e7ekli\u011fi: Bant Geni\u015fli\u011fi T\u0131kan\u0131kl\u0131klar\u0131 ve Enerji Maliyetleri<\/h2>\n<p>\u0130kinci bir ekran kart\u0131 sat\u0131n ald\u0131\u011f\u0131n\u0131zda, sadece bellek sat\u0131n alm\u0131yorsunuz; ayn\u0131 zamanda \u0131s\u0131, g\u00fc\u00e7 t\u00fcketimi ve \u00e7ok daha karma\u015f\u0131k bir sistem mimarisi de sat\u0131n al\u0131yorsunuz. Homelab y\u00f6neticilerinin yapt\u0131\u011f\u0131 en yayg\u0131n hata, ayn\u0131 modeli \u00e7al\u0131\u015ft\u0131ran iki GPU&#8217;nun performans\u0131n iki kat\u0131n\u0131 verece\u011fini varsaymakt\u0131r. B\u00fcy\u00fck Dil Modeli (LLM) \u00e7\u0131kar\u0131m\u0131 d\u00fcnyas\u0131nda, bu varsay\u0131m NVLink ve devasa y\u0131\u011f\u0131n boyutlar\u0131 i\u00e7eren \u00e7ok spesifik ko\u015fullar d\u0131\u015f\u0131nda neredeyse hi\u00e7bir zaman do\u011fru de\u011fildir.<\/p>\n<h3>PCIe Bant Geni\u015fli\u011fi T\u0131kan\u0131kl\u0131\u011f\u0131<\/h3>\n<p>Konsumeri \u00e7oklu GPU kurulumlar\u0131ndaki temel k\u0131s\u0131tlama, Peripheral Component Interconnect (PCIe) veri yolud\u0131r. Modern t\u00fcketici anakartlar\u0131 genellikle PCIe Gen4 x16 yollar\u0131 sunar, ancak bunlar bile b\u00fcy\u00fck modellerin \u00e7\u0131kar\u0131m\u0131 s\u0131ras\u0131nda gereken veri hareketi i\u00e7in yetersizdir. Model iki GPU&#8217;ya b\u00f6l\u00fcnd\u00fc\u011f\u00fcnde, kat\u0131klar s\u00fcrekli ileti\u015fim kurmak zorunda kal\u0131r. \u0130leri pass (forward pass) s\u0131ras\u0131nda GPU 0&#8217;dan \u00e7\u0131kan \u00e7\u0131kt\u0131, GPU 1 i\u00e7in girdi olur. Geri d\u00f6n\u00fc\u015f (backward pass) veya ba\u011flam olu\u015fturma s\u0131ras\u0131nda ise verinin geri akmas\u0131 gerekir.<\/p>\n<p>\u0130ki RTX 3090&#8217;\u0131n\u0131z oldu\u011funu d\u00fc\u015f\u00fcn\u00fcn. Her birinde 24GB VRAM var. 70B parametreli bir modeli y\u00fcklersiniz. Model tek bir kartta s\u0131\u011fmaz. \u00c7\u0131kar\u0131m motoru kat\u0131klar\u0131 b\u00f6lmek zorundad\u0131r. GPU 0 0-20. katmanlar\u0131, GPU 1 ise 21-50. katmanlar\u0131 i\u015fler. Model bir token \u00fcretti\u011finde, GPU 0 hesaplamas\u0131n\u0131 tamamlar ve sonucu PCIe veri yolu \u00fczerinden GPU 1&#8217;e g\u00f6ndermek zorundad\u0131r. PCIe bant geni\u015fli\u011finiz doymu\u015fsa, GPU 1 veriyi bekleyerek at\u0131l durumdad\u0131r. Buna &#8220;PCIy stall&#8221; (PCIy tak\u0131lmas\u0131) denir.<\/p>\n<p>Ni\u015fan bir tek GPU \u00e7al\u0131\u015ft\u0131rmada veri hi\u00e7 VRAM&#8217;den \u00e7\u0131kmaz. \u00c7oklu GPU PCIe \u00e7al\u0131\u015ft\u0131rmada ise, \u00fcretilen her token i\u00e7in veri anakart \u00fczerinde defalarca dola\u015f\u0131r. 70B&#8217;lik bir model i\u00e7in veri hareketi b\u00fcy\u00fckt\u00fcr. PCIe Gen3 x16 ile teorik bant geni\u015fli\u011fi yakla\u015f\u0131k 16 GB\/s&#8217;dir. Ger\u00e7ek d\u00fcnya verimlili\u011fi genellikle protokol y\u00fck\u00fc nedeniyle daha d\u00fc\u015f\u00fckt\u00fcr. E\u011fer \u00e7\u0131kar\u0131m motorunuz 20 GB\/s ara GPU trafi\u011fi gerektiriyorsa, hesaplamay\u0131 bile ba\u015flatmadan \u00f6nce darbo\u011faza tak\u0131l\u0131rs\u0131n\u0131z. Bu da standart bir t\u00fcketici anakart\u0131na ikinci bir GPU eklemenin, umdu\u011funuz 2.0x h\u0131zlanman\u0131n yerine yaln\u0131zca 1.3x ile 1.5x h\u0131zlanma sa\u011flayabilece\u011fi anlam\u0131na gelir.<\/p>\n<h3>G\u00fc\u00e7 ve Is\u0131 Denklemi<\/h3>\n<p>Bunu muhtemelen bulut \u00fccretleri \u00f6demeden daha b\u00fcy\u00fck modelleri yerel olarak \u00e7al\u0131\u015ft\u0131rmak istedi\u011finiz i\u00e7in yap\u0131yorsunuz. Ancak g\u00fc\u00e7 maliyetini hesaba katmal\u0131s\u0131n\u0131z. Tam y\u00fck alt\u0131nda modern bir GPU 250W ile 350W aras\u0131nda g\u00fc\u00e7 \u00e7ekebilir. \u0130kisini \u00e7al\u0131\u015ft\u0131rmak, ev y\u00fck\u00fcn\u00fcze 500W ile 700W eklemek demektir. 24 saatlik bir d\u00f6nemde bu, ek olarak 12 ile 16.8 kWh anlam\u0131na gelir. Yerel elektrik oranlar\u0131n\u0131za ba\u011fl\u0131 olarak bu, ayl\u0131k \u00f6nemli bir giderdir.<\/p>\n<p>Evinizdeki elektrik oran\u0131 d\u00fc\u015f\u00fcktse, diyelim ki kWh ba\u015f\u0131na 0.10$, iki GPU&#8217;lu bir kurulumu \u00e7al\u0131\u015ft\u0131rmak i\u00e7in ayl\u0131k 50$ ile 70$ aras\u0131 ek \u00fccret \u00f6d\u00fcyor olabilirsiniz. E\u011fer California veya Avrupa&#8217;n\u0131n baz\u0131 b\u00f6lgeleri gibi daha y\u00fcksek oranl\u0131 bir yerde ya\u015f\u0131yorsan\u0131z, bu rakam ikiye katlanabilir. \u00dcstelik g\u00fc\u00e7 denklemin sadece yar\u0131s\u0131d\u0131r; di\u011fer yar\u0131s\u0131 \u0131s\u0131d\u0131r.<\/p>\n<p>K\u00fc\u00e7\u00fck bir kasada (standart bir 1U veya 2U sunucu kasas\u0131nda veya hatta modifiye edilmi\u015f bir ATX kulede) iki GPU, bir \u0131s\u0131 adas\u0131 yarat\u0131r. Kasan\u0131n i\u00e7indeki hava \u00e7ok \u00e7abuk \u0131s\u0131n\u0131r. \u00c7o\u011fu t\u00fcketici ekran kart\u0131n\u0131n maksimum s\u0131cakl\u0131k limiti 83\u00b0C ile 90\u00b0C aras\u0131ndad\u0131r. Bu e\u015fi\u011fe ula\u015ft\u0131klar\u0131nda h\u0131z d\u00fc\u015f\u00fcr\u00fcrler (throttle). H\u0131z d\u00fc\u015f\u00fcrme, her tokenin \u00fcretimi i\u00e7in gereken s\u00fcreyi art\u0131r\u0131r, bu da <em>daha fazla<\/em> \u0131s\u0131 \u00fcretir. Sonu\u00e7 olarak, donan\u0131m\u0131n kendini so\u011futmak i\u00e7in daha fazla \u00e7al\u0131\u015ft\u0131\u011f\u0131 ve performans\u0131 daha da k\u00f6t\u00fcle\u015ftirdi\u011fi bir olumsuz geri besleme d\u00f6ng\u00fcs\u00fcnde olursunuz.<\/p>\n<p>Bunu daha fazla fanla \u00e7\u00f6zebilece\u011finizi d\u00fc\u015f\u00fcnebilirsiniz, ancak h\u0131zla ses s\u0131n\u0131rlar\u0131na tak\u0131l\u0131rs\u0131n\u0131z. \u0130ki 3090&#8217;\u0131 %100 fan h\u0131z\u0131nda \u00e7al\u0131\u015ft\u0131rmak, u\u00e7a\u011f\u0131n kalk\u0131\u015f yapt\u0131\u011f\u0131 gibi bir g\u00fcr\u00fclt\u00fc yarat\u0131r. Evinizde veya ofisinizde bu g\u00fcr\u00fclt\u00fcy\u00fc tolere edemiyorsan\u0131z, kartlar\u0131 a\u015f\u0131nd\u0131rmak (underclock) veya gerilimi d\u00fc\u015f\u00fcrmek (undervolt) zorunda kal\u0131rs\u0131n\u0131z. Gerilim d\u00fc\u015f\u00fcrme, manuel ayarlama gerektirir ve yanl\u0131\u015f yap\u0131l\u0131rsa karars\u0131zl\u0131\u011fa yol a\u00e7abilir.<\/p>\n<h3>Neden &#8220;Daha Fazla GPU&#8221; Her Zaman &#8220;Daha H\u0131zl\u0131&#8221; De\u011fildir<\/h3>\n<p>&#8220;Daha fazla GPU daha h\u0131zl\u0131d\u0131r&#8221; s\u00f6z\u00fc, pazarlama departmanlar\u0131 taraf\u0131ndan sat\u0131lan bir yaland\u0131r. Ger\u00e7eklik daha incelidir. Bir GPU eklemek kapasiteyi (\u00e7al\u0131\u015ft\u0131rabilece\u011finiz model boyutunu) art\u0131r\u0131r, ancak yukar\u0131da belirtilen y\u00fcklerden dolay\u0131 verimlili\u011fi (token alma h\u0131z\u0131n\u0131) genellikle d\u00fc\u015f\u00fcr\u00fcr.<\/p>\n<p>Belirli bir senaryoya bakal\u0131m. 13B parametreli bir modeliniz var. Bu, tek bir 12GB kart\u0131n rahat\u00e7a s\u0131\u011fabilece\u011fi bir boyuttad\u0131r. Tek bir kartta \u00e7al\u0131\u015ft\u0131rd\u0131\u011f\u0131n\u0131zda y\u00fcksek token verimlili\u011fi, d\u00fc\u015f\u00fck gecikme ve d\u00fc\u015f\u00fck g\u00fc\u00e7 t\u00fcketimi elde edersiniz. Ancak bunu iki 8GB kart \u00fczerinde \u00e7al\u0131\u015ft\u0131rmay\u0131 (modeli b\u00f6lerek) denerseniz, PCIe y\u00fck\u00fc getirmi\u015f, g\u00fc\u00e7 t\u00fcketimi ikiye katlanm\u0131\u015f ve \u0131s\u0131 artm\u0131\u015f olursunuz. B\u00fcy\u00fck ihtimalle tek kartta \u00e7al\u0131\u015ft\u0131rmaya g\u00f6re <em>daha yava\u015f<\/em> performans al\u0131rs\u0131n\u0131z. K\u00fc\u00e7\u00fck y\u0131\u011f\u0131n boyutlar\u0131nda &#8220;\u00f6l\u00e7ek verimsizli\u011fi&#8221; budur.<\/p>\n<p>\u00c7oklu GPU, model, tek bir t\u00fcketici kart\u0131n\u0131n VRAM kapasitesinden b\u00fcy\u00fck oldu\u011funda anlaml\u0131d\u0131r. \u00d6rne\u011fin, bir 70B model (yakla\u015f\u0131k 35GB FP16 veya 30GB Q4_K_M i\u00e7in) tek bir 24GB kartta s\u0131\u011fmaz. <em>Zorunlu<\/em> olarak iki kart kullanman\u0131z gerekir. Bu spesifik durumda, m\u00fcbadele kabul edilebilir: modeli \u00e7al\u0131\u015ft\u0131rmay\u0131 sa\u011flars\u0131n\u0131z, ancak gecikme cezas\u0131n\u0131 kabul edersiniz. Model tek bir kartta s\u0131\u011fabiliyorsa, onu b\u00f6lme. G\u00fc\u00e7 ve donan\u0131m karma\u015f\u0131kl\u0131\u011f\u0131 i\u00e7in hi\u00e7bir getiri elde etmeden para harcam\u0131\u015f olursunuz.<\/p>\n<h3>Karars\u0131zl\u0131\u011f\u0131n Ger\u00e7ek D\u00fcnya Maliyeti<\/h3>\n<p>\u00c7oklu GPU kurulumlar\u0131n\u0131n g\u00f6r\u00fcnmeyen bir maliyeti de operasyonel karars\u0131zl\u0131kt\u0131r. T\u00fcketici s\u00fcr\u00fcc\u00fcleri, LLM \u00e7\u0131kar\u0131m\u0131n\u0131n s\u00fcrd\u00fcr\u00fclen, a\u011f\u0131r y\u00fck\u00fc i\u00e7in tasarlanmam\u0131\u015ft\u0131r. \u0130ki kart \u00e7al\u0131\u015ft\u0131rd\u0131\u011f\u0131n\u0131zda bellek y\u00fck\u00fc da\u011f\u0131t\u0131l\u0131r, ancak b\u00f6l\u00fcnmeyi y\u00f6netmek i\u00e7in gereken sistem belle\u011fi (RAM) kullan\u0131m\u0131 patlama yapabilir. \u0130\u015fletim sisteminin bellek doluluk (OOM) \u00f6ld\u00fcr\u00fcc\u00fcs\u00fc tetiklenirse, t\u00fcm kurulumunuz \u00e7\u00f6ker.<\/p>\n<p>Ayr\u0131ca ba\u015far\u0131s\u0131zl\u0131k noktalar\u0131 olu\u015fturursunuz. Bir GPU s\u00fcr\u00fcc\u00fcs\u00fc \u00e7\u00f6kerse veya kart a\u015f\u0131r\u0131 \u0131s\u0131narak kapan\u0131rsa, t\u00fcm \u00e7\u0131kar\u0131m i\u015flemi durur. \u00c7\u0131kar\u0131m motoru zarif bir \u015fekilde toparlanamayabilir. Tek GPU&#8217;lu bir kurulumda, bir \u00e7\u00f6k\u00fc\u015f yeniden ba\u015flatmad\u0131r. \u00c7oklu GPU kurulumunda ise, bekleyen t\u00fcm istekler i\u00e7in ba\u011flam penceresini kaybetmi\u015f olabilirsiniz. Maliyet kadar &#8220;\u00e7al\u0131\u015fma s\u00fcresine&#8221; (uptime) \u00f6nem veren bir homelab y\u00f6neticisi i\u00e7in bu, kritik bir riskdir.<\/p>\n<p>Kendinize sorman\u0131z gereken \u015fey: 70B&#8217;lik bir modeli \u00e7al\u0131\u015ft\u0131rabilme yetene\u011fi, artan karma\u015f\u0131l\u0131k de\u011ferinde mi? E\u011fer tek bir 24GB kart \u00fczerinde (Llama 3 34B gibi) nicemlenmi\u015f bir 34B model kullanabilirseniz, y\u00fcksek performans, d\u00fc\u015f\u00fck gecikme ve kararl\u0131l\u0131k elde edersiniz. E\u011fer 70B modelin saf zekas\u0131na ihtiyac\u0131n\u0131z varsa, iki karta ihtiyac\u0131n\u0131z var. Ancak PCIe darbo\u011fac\u0131n\u0131 kabul etmelisiniz.<\/p>\n<h2>Ollama: \u00c7oklu Cihaz \u00c7al\u0131\u015ft\u0131rmas\u0131 \u0130\u00e7in Pratik Bir Sar\u0131c\u0131<\/h2>\n<p>Ollama, LLM&#8217;ler i\u00e7in &#8220;tamam\u0131 i\u015fe yarar&#8221; \u00e7\u00f6z\u00fcm olarak konumland\u0131rd\u0131. Bir homelab operat\u00f6r\u00fc i\u00e7in bu en g\u00fc\u00e7l\u00fc sat\u0131\u015f noktas\u0131d\u0131r. Python ortamlar\u0131, ba\u011f\u0131ml\u0131l\u0131k y\u00f6netimi ve Docker konteynerlerinin karma\u015f\u0131kl\u0131\u011f\u0131n\u0131 soyutlar. Ancak \u00e7oklu GPU deste\u011fi sundu\u011funuzda, &#8220;sadece \u00e7al\u0131\u015f\u0131r&#8221; b\u00fcy\u00fcs\u00fc daha k\u0131r\u0131lgan hale gelir. Ollama&#8217;n\u0131n \u00e7oklu GPU uygulamas\u0131 pragmatiktir ancak tek i\u015flemli bir sar\u0131c\u0131 mimarisiyle s\u0131n\u0131rl\u0131d\u0131r.<\/p>\n<h3>Ollama Model B\u00f6lme Nas\u0131l Yapar<\/h3>\n<p>Ollama, katmanl\u0131 bir b\u00f6lme yakla\u015f\u0131m\u0131 kullan\u0131r. Tek bir GPU&#8217;ya s\u0131\u011fmayan bir model istedi\u011finizde, Ollama sinir a\u011f\u0131 katmanlar\u0131n\u0131 mevcut cihazlar aras\u0131nda b\u00f6l\u00fc\u015ft\u00fcrmeye \u00e7al\u0131\u015f\u0131r. Bunu otomatik olarak yapar, ancak davran\u0131\u015f\u0131 genellikle ortam de\u011fi\u015fkenleriyle etkileyebilirsiniz. Motor, ilk GPU&#8217;ya s\u0131\u011fmayan en b\u00fcy\u00fck katman\u0131 belirler ve onu sonrakine ta\u015f\u0131r.<\/p>\n<p>\u0130ki RTX 3090&#8217;l\u0131 tipik bir kurulumda Ollama, ilk 20 katman\u0131 GPU 0&#8217;a ve kalan katmanlar\u0131 GPU 1&#8217;e atayabilir. Kritik nokta \u015fudur: Bu b\u00f6l\u00fcnme model y\u00fckleme a\u015famas\u0131nda ger\u00e7ekle\u015fir. Katmanlar \u00e7\u0131kar\u0131m (inference) s\u0131ras\u0131nda dinamik olarak i\u00e7eri al\u0131nmaz veya \u00e7\u0131kar\u0131lmaz; sabittir. Model y\u00fcklendikten sonra GPU 0 alt katmanlardan, GPU 1 ise \u00fcst katmanlardan sorumludur.<\/p>\n<p>Bu statik b\u00f6lme basit i\u015f y\u00fckleri i\u00e7in verimlidir ancak darbo\u011fazlar yaratabilir. Katmanlar aras\u0131ndaki hesaplama y\u00fck\u00fc dengesizse (ki bu bazen olur), bir GPU i\u015fini di\u011ferinden \u00f6nce bitirebilir, ancak bir sonraki token olu\u015fturulmadan \u00f6nce di\u011ferinin ileri ge\u00e7i\u015fi (forward pass) bitirmesini beklemek zorunda kal\u0131r. Bu senkronizasyon maliyeti tasar\u0131m\u0131n do\u011fas\u0131nda vard\u0131r.<\/p>\n<h3>Kurulum ve Yap\u0131land\u0131rma Kolayl\u0131\u011f\u0131<\/h3>\n<p>Operat\u00f6rlerin Ollama&#8217;y\u0131 se\u00e7mesinin birincil nedeni kurulum kolayl\u0131\u011f\u0131d\u0131r. Tek bir komutla bir sunucu ba\u015flatabilirsiniz: <code>ollama serve<\/code>. \u00c7oklu GPU ortamlar\u0131nda CUDA s\u00fcr\u00fcc\u00fclerinin do\u011fru \u015fekilde haritaya eklenmesini sa\u011flamak i\u00e7in genellikle konteynerize bir kurulum gerekir.<\/p>\n<p>Ollama&#8217;n\u0131n \u00e7oklu GPU kullanmak i\u00e7in bir Docker konteynerinde nas\u0131l yap\u0131land\u0131r\u0131laca\u011f\u0131na dair \u00f6rnek budur. Bunun i\u00e7in NVIDIA Container Toolkit gereklidir.<\/p>\n<pre><code class=\"language-bash\">docker run -d --name ollama-multi-gpu --gpus all \\\n  -p 11434:11434 \\\n  -v ollama:\/root\/.ollama \\\n  ollama\/ollama:latest\n<\/code><\/pre>\n<p><code>--gpus all<\/code> bayra\u011f\u0131 hayati \u00f6nem ta\u015f\u0131r. Docker&#8217;a kullan\u0131ma haz\u0131r t\u00fcm GPU&#8217;lar\u0131 konteynerle payla\u015fmas\u0131 talimat\u0131n\u0131 verir. Ollama ard\u0131ndan mevcut CUDA cihazlar\u0131n\u0131 tarar ve model katmanlar\u0131n\u0131 buna g\u00f6re atar. Hangi GPU&#8217;lar\u0131n kullan\u0131ld\u0131\u011f\u0131n\u0131 kontrol loglar\u0131n\u0131 inceleyerek veya <code>nvidia-smi<\/code> komutuyla do\u011frulayabilirsiniz.<\/p>\n<p>Ancak s\u0131n\u0131rlamalar vard\u0131r. Ollama&#8217;n\u0131n \u00e7oklu GPU deste\u011fi nispeten yenidir ve vLLM kadar olgun de\u011fildir. K\u00fc\u00e7\u00fck topluluklar i\u00e7in gecikmeyi optimize eder, y\u00fcksek ak\u0131\u015fl\u0131 hizmet sunma i\u00e7in de\u011fil. \u00c7oklu kullan\u0131c\u0131 i\u00e7in bir API sunucusu \u00e7al\u0131\u015ft\u0131r\u0131yorsan\u0131z, Ollama e\u015fzamanl\u0131 isteklerde zorlanabilir. Tek kullan\u0131c\u0131 etkile\u015fimi veya k\u00fc\u00e7\u00fck bir istek grubu i\u00e7in m\u00fckemmeldir, ancak kurumsal motorlar sunan geli\u015fmi\u015f istek zamanlama yetene\u011finden yoksundur.<\/p>\n<h3>E\u015fzamanl\u0131l\u0131k S\u0131n\u0131rlamalar\u0131<\/h3>\n<p>Ollama yerel \u00e7al\u0131\u015fan bir ki\u015fisel asistan i\u00e7in harika olsa da, ayn\u0131 anda \u00e7oklu kullan\u0131c\u0131y\u0131 y\u00f6netmeniz gerekti\u011finde yetersiz kal\u0131r. Ollama&#8217;daki varsay\u0131lan \u00e7\u0131kar\u0131m motoru istekleri s\u0131rayla veya k\u00fc\u00e7\u00fck topluluklar halinde i\u015fler. \u0130ki kullan\u0131c\u0131 ayn\u0131 anda istek g\u00f6nderdi\u011finde, Ollama bunlar\u0131 kuyru\u011fa al\u0131r.<\/p>\n<p>\u00c7oklu GPU kurulumunda bu kuyrulama darbo\u011faza d\u00f6n\u00fc\u015febilir. 48GB VRAM&#8217;iniz olsa bile, topluluk boyutu k\u00fc\u00e7\u00fckse sistem her iki GPU&#8217;yu da verimli kullanmayabilir. PCIe busu katmanlar aras\u0131nda veri aktar\u0131m\u0131 i\u00e7in darbo\u011faz haline gelir ve Ollama&#8217;n\u0131n tek i\u015flemli mimarisi, \u00f6nemli bir i\u00e7 karma\u015f\u0131kl\u0131k olmadan istek i\u015fleme i\u015flemi \u00e7oklu GPU \u00e7iftleri aras\u0131nda kolayca paralelle\u015ftiremez.<\/p>\n<h3>Operat\u00f6r Notlar\u0131: &#8220;Ollama&#8221; Takas Dengesi<\/h3>\n<p>Ollama&#8217;y\u0131 se\u00e7erseniz, esneklik yerine konforu al\u0131rs\u0131n\u0131z. Y\u00f6netimi kolay sa\u011flam bir sistem elde edersiniz, ancak katman b\u00f6lme mant\u0131\u011f\u0131n\u0131 ince ayar yapma veya haf\u0131za d\u00fczenini dinamik olarak de\u011fi\u015ftirme yetene\u011finden yoksun kal\u0131rs\u0131n\u0131z. Haf\u0131za d\u00fczenine a\u015f\u0131r\u0131 duyarl\u0131 (baz\u0131 \u00f6zel nicemleme formatlar\u0131 gibi) bir modeli \u00e7al\u0131\u015ft\u0131rman\u0131z gerekiyorsa, Ollama gerekli ince ayar butonlar\u0131n\u0131 (knobs) sunmayabilir.<\/p>\n<p>Ayr\u0131ca kaynak maliyetine dikkat edin. Ollama bellekte kalan bir sunucu i\u015flemi \u00e7al\u0131\u015ft\u0131r\u0131r. Birden fazla model \u00e7al\u0131\u015ft\u0131r\u0131yorsan\u0131z, Ollama bunlar\u0131n hepsini y\u00fckler ve RAM ve VRAM t\u00fcketir. Sisteminizi \u00e7ok zorlarsan\u0131z bu OOM (Out of Memory) hatalar\u0131na yol a\u00e7abilir. Kullan\u0131lmayan modelleri durdurmak konusunda titiz olman\u0131z gerekir.<\/p>\n<h2>vLLM: Y\u00fcksek E\u015fzamanl\u0131l\u0131k \u0130\u015f Y\u00fckleri \u0130\u00e7in Bant Geni\u015fli\u011fi Canavar\u0131<\/h2>\n<p>E\u011fer Ollama pratik bir sarmalay\u0131c\u0131 (wrapper) ise, vLLM \u00f6l\u00e7eklenebilirlik i\u00e7in tasarlanm\u0131\u015f y\u00fcksek performansl\u0131 motordur. UC Berkeley ara\u015ft\u0131rmac\u0131lar\u0131 taraf\u0131ndan geli\u015ftirilen vLLM, PagedAttention ad\u0131 verilen bir teknik kullanarak LLM sunumundaki bellek darbo\u011faz\u0131n\u0131 \u00e7\u00f6zmeye \u00f6zel olarak kurgulanm\u0131\u015ft\u0131r. Birden fazla kullan\u0131c\u0131y\u0131 y\u00f6neten \u00e7oklu GPU&#8217;lu bir istemci sunucusu kurmaya \u00e7al\u0131\u015fan bir homelab operat\u00f6r\u00fc i\u00e7in, \u00f6\u011frenme e\u011frisinin daha dik olmas\u0131 ve kaynak y\u00fck\u00fc daha fazla olmas\u0131na ra\u011fmen, vLLM genellikle \u00fcst\u00fcn bir tercihtir.<\/p>\n<h3>PagedAttention Mekanizmas\u0131<\/h3>\n<p>Traditional LLM serving engines store the attention key-value (KV) cache in a contiguous block of memory. As the context window grows, this memory grows linearly. To save memory, you have to reduce the batch size or the context window. This is inefficient.<\/p>\n<p>vLLM, PagedAttention&#8217;\u0131 tan\u0131t\u0131r. Bu mekanizma KV \u00f6nbelle\u011fini birbirine ba\u011fl\u0131 olmayan k\u00fc\u00e7\u00fck bloklara b\u00f6ler. \u0130\u015fletim sistemlerinin fiziksel RAM sayfalar\u0131n\u0131 y\u00f6netme bi\u00e7imine benzer \u015fekilde, bu bloklar\u0131 birbirine ba\u011flamak i\u00e7in bir bellek y\u00f6netim tablosu kullan\u0131r. Bu sayede vLLM \u015funlar\u0131 yapabilir:<br \/>\n1.  <strong>Bellek verimlili\u011fi:<\/strong> Ayn\u0131 \u00f6nekle (prefix) sahip istekler varsa, ayn\u0131 bellek blo\u011fu istekler aras\u0131nda payla\u015f\u0131labilir (tekrar \u00f6nleme).<br \/>\n2.  <strong>Par\u00e7alanmay\u0131 ortadan kald\u0131r\u0131r:<\/strong> Bellek sabit boyutlu bloklar halinde tahsis edilir, bu da bo\u015fa harcanan alan kalmaz demektir.<br \/>\n3.  <strong>B\u00fcy\u00fck batch&#8217;leri destekler:<\/strong> Bellek daha verimli kullan\u0131ld\u0131\u011f\u0131 i\u00e7in, ayn\u0131 donan\u0131m \u00fczerinde daha fazla e\u015fzamanl\u0131 istek s\u0131\u011fd\u0131rabilirsiniz.<\/p>\n<p>Bir \u00e7oklu GPU kurulumunda, bu verimlilik kritiktir. Bir modeli iki GPU&#8217;ya b\u00f6ld\u00fc\u011f\u00fcn\u00fczde, KV \u00f6nbelle\u011fi i\u00e7in bellek fazlal\u0131\u011f\u0131 yine de mevcuttur. vLLM&#8217;in bu belle\u011fi y\u00f6netebilme kabiliyeti, bellek tavan\u0131na \u00e7arpmadan \u00f6nce daha fazla iste\u011fi \u00e7al\u0131\u015ft\u0131rabilmenizi sa\u011flar.<\/p>\n<h3>Da\u011f\u0131t\u0131k Sunum Yetenekleri<\/h3>\n<p>vLLM, tensor paralelli\u011fi (TP) ve pipeline paralelli\u011fi (PP) kullanan da\u011f\u0131t\u0131k \u00e7\u0131kar\u0131m i\u00e7in yerel deste\u011fe sahiptir. vLLM&#8217;\u00fcn \u00e7oklu GPU&#8217;lu bir homelab ortam\u0131nda parlad\u0131\u011f\u0131 yer buras\u0131d\u0131r.<\/p>\n<ul>\n<li><strong>Tensor Paralelli\u011fi (TP):<\/strong> Model GPU&#8217;lar aras\u0131nda b\u00f6l\u00fcn\u00fcr. Her GPU, ayn\u0131 katman\u0131n bir dilimini i\u015fler. \u00d6rne\u011fin, bir katmanda 1024 giri\u015f kanal\u0131 varsa, GPU 0 ilk 512&#8217;sini, GPU 1 son 512&#8217;sini i\u015fler. Her tokende kilitle\u015fmi\u015f olarak (lockstep) \u00e7al\u0131\u015f\u0131rlar. Bu, y\u00fcksek h\u0131zl\u0131 ba\u011flant\u0131lar (NVLink gibi) gerektirir ancak devasa h\u0131z art\u0131\u015flar\u0131 sunar.<\/li>\n<li><strong>Pipeline Paralelli\u011fi (PP):<\/strong> Farkl\u0131 GPU&#8217;lar modelin farkl\u0131 katmanlar\u0131n\u0131 i\u015fler. GPU 0 0-10 katmanlar\u0131n\u0131, GPU 1 11-20 katmanlar\u0131n\u0131 vb. i\u015fler. Bu, Ollama&#8217;n\u0131n yakla\u015f\u0131m\u0131na benzer ancak daha iyi batch planlamas\u0131 ve \u00f6rt\u00fc\u015fme sa\u011flar.<\/li>\n<\/ul>\n<p>PCIe \u00fczerinden ba\u011fl\u0131 t\u00fcketici kartlar\u0131 i\u00e7in vLLM yine de TP veya PP kullanabilir, ancak performans kazan\u00e7lar\u0131 veriyolu ile s\u0131n\u0131rlanacakt\u0131r. Ancak vLLM, Ollama&#8217;ya k\u0131yasla gecikme fazlal\u0131\u011f\u0131n\u0131 daha iyi y\u00f6netecek \u015fekilde optimize edilmi\u015ftir. Mevcut bant geni\u015fli\u011finin kullan\u0131m\u0131n\u0131 maksimize etmek i\u00e7in daha sofistike bir zamanlay\u0131c\u0131 kullan\u0131r.<\/p>\n<h3>Y\u00fcksek E\u015fzamanl\u0131l\u0131k vs. D\u00fc\u015f\u00fck Gecikme<\/h3>\n<p>vLLM, bant geni\u015fli\u011fi (throughput) i\u00e7in tasarlanm\u0131\u015ft\u0131r. Ayn\u0131 anda 100 kullan\u0131c\u0131ya hizmet veren bir web siteniz veya API&#8217;niz varsa, bu senaryoda vLLM Ollama&#8217;y\u0131 ezer ge\u00e7er. Bir\u00e7ok aktif iste\u011fe ra\u011fmen y\u00fcksek bir token olu\u015fturma h\u0131z\u0131n\u0131 koruyabilir.<\/p>\n<p>Bununla birlikte, vLLM de bedelini ister. llama.cpp veya Ollama&#8217;dan daha y\u00fcksek bir bellek fazlal\u0131\u011f\u0131na sahiptir. PagedAttention uygulamas\u0131 ek metadata yap\u0131lar\u0131 gerektirir. Ayr\u0131ca vLLM&#8217;i dikkatli bir \u015fekilde yap\u0131land\u0131rman\u0131z gerekir. GPU say\u0131s\u0131n\u0131, tensor paralellik boyutunu ve pipeline paralellik boyutunu belirtmeniz gerekir.<\/p>\n<pre><code class=\"language-yaml\"># docker-compose.yaml example for vLLM multi-GPU serving\nversion: '3.8'\nservices:\n  vllm:\n    image: vllm\/vllm-openai:latest\n    command:\n      - --model\n      - meta-llama\/Llama-3-70B-Instruct\n      - --tensor-parallel-size\n      - \"2\"\n      - --dtype\n      - float16\n      - --max-model-len\n      - \"8192\"\n      - --gpu-memory-utilization\n      - \"0.95\"\n    deploy:\n      resources:\n        reservations:\n          devices:\n            - driver: nvidia\n              count: 2\n              capabilities: [gpu]\n    ports:\n      - \"8000:8000\"\n    volumes:\n      - .\/models:\/models\n<\/code><\/pre>\n<p>Bu yap\u0131land\u0131rma, vLLM&#8217;in Llama 3 70B modelini 2 boyutunda tensor paralelli\u011fi ile \u00e7al\u0131\u015ft\u0131raca\u011f\u0131n\u0131 belirtir. \u0130ki GPU&#8217;ya sahip oldu\u011funuzu varsayar. VRAM kullan\u0131m\u0131n\u0131 en \u00fcst d\u00fczeye \u00e7\u0131karmak i\u00e7in <code>gpu-memory-utilization<\/code> bayra\u011f\u0131 0.95 olarak ayarlanm\u0131\u015ft\u0131r; g\u00fcvenlik i\u00e7in k\u00fc\u00e7\u00fck bir tampon b\u0131rak\u0131l\u0131r.<\/p>\n<h3>Operat\u00f6r Notlar\u0131: Karma\u015f\u0131kl\u0131k Vergisi<\/h3>\n<p>Homelab ortam\u0131nda vLLM kullanmak s\u0131radan bir i\u015f de\u011fildir. Docker, a\u011f ve vLLM yap\u0131land\u0131rma bayraklar\u0131n\u0131 anlaman\u0131z gerekir. Tensor paralellik boyutunu yanl\u0131\u015f yap\u0131land\u0131r\u0131rsan\u0131z, sunucu \u00e7\u00f6ker. Model uzunlu\u011funu \u00e7ok y\u00fcksek ayarlarsan\u0131z, bellek yetersiz kal\u0131r.<\/p>\n<p>Ancak, &#8220;\u00fcretim benzeri stabilite&#8221; ve bant geni\u015fli\u011fine de\u011fer veren bir homelab operat\u00f6r\u00fc i\u00e7in vLLM en iyi ara\u00e7t\u0131r. Bellek y\u00f6netimi ve istek zamanlamas\u0131 gibi zorlu i\u015fleri halleder; b\u00f6ylece siz altyap\u0131ya de\u011fil, i\u015f y\u00fck\u00fcne odaklanabilirsiniz.<\/p>\n<h2>llama.cpp: Gran\u00fcler Kontrol \u0130\u00e7in Ham Motor<\/h2>\n<p>Ollama bir kabuk, vLLM ise optimize edilmi\u015f bir sunucu ise, llama.cpp do\u011frudan \u00e7al\u0131\u015fan ham motordur. Bir C\/C++ k\u00fct\u00fcphanesi olarak bir\u00e7ok arac\u0131n (Ollama dahil) arkas\u0131ndaki itici g\u00fc\u00e7t\u00fcr. llama.cpp&#8217;yi do\u011frudan \u00e7al\u0131\u015ft\u0131rmak, donan\u0131m, model quantization (nicemleme) ve bellek d\u00fczeni \u00fczerinde en geni\u015f kontrol\u00fc sa\u011flar. GPU&#8217;lar\u0131n\u0131zdan her bir son damla performans\u0131 s\u0131kmak isteyen operat\u00f6r i\u00e7in llama.cpp krald\u0131r.<\/p>\n<h3>GGUF Format\u0131 ve Quantization (Nicemleme)<\/h3>\n<p>llama.cpp&#8217;nin g\u00fcc\u00fcn\u00fcn kalbi GGUF format\u0131ndad\u0131r. Genellikle FP16 veya FP32 olan standart PyTorch modellerinin aksine, GGUF modelleri bellek hacmini d\u00fc\u015f\u00fcrmek i\u00e7in nicemleme kullan\u0131r. Yayg\u0131n nicemleme se\u00e7enekleri Q4_K_M (4-bit) ve Q8_0 (8-bit)&#8217;dir.<\/p>\n<p>Nicemleme, modelin gerektirdi\u011fi belle\u011fi 4&#8217;ten 8&#8217;e kadar azalt\u0131r. FP16 format\u0131nda 140GB gerektiren bir 70B model, Q4_K_M olarak nicemlendi\u011finde yakla\u015f\u0131k 40GB VRAM&#8217;e s\u0131\u011fd\u0131r\u0131labilir. \u0130\u015fte llama.cpp&#8217;nin homelab toplulu\u011funda bu kadar pop\u00fcler olmas\u0131n\u0131n sebebi: Aksi halde imkans\u0131z olacak devasa modelleri t\u00fcketiciler seviyesinde donan\u0131mda \u00e7al\u0131\u015ft\u0131rman\u0131za izin verir.<\/p>\n<p>llama.cpp&#8217;yi \u00e7oklu GPU i\u00e7in kullan\u0131rken, modeli s\u0131\u011fd\u0131rmak i\u00e7in genellikle bu nicemleme avantaj\u0131n\u0131 kullan\u0131rs\u0131n\u0131z. Ancak model VRAM&#8217;den fazlaysa, katmanlar\u0131 sistem RAM&#8217;ine ta\u015f\u0131mak (offload) se\u00e7ene\u011finiz de vard\u0131r. Bu se\u00e7enek her zaman \u00e7al\u0131\u015fmaz; bellek yolundaki gecikme (latency) ciddi \u015fekilde artar ve bunu bir \u00fcretim ortam\u0131nda beklersiniz. Alternatif \u00e7\u00f6z\u00fcmler genelde buna izin vermez veya \u00e7ok daha yava\u015ft\u0131r.<\/p>\n<h3>Manuel Tens\u00f6r B\u00f6l\u00fcmleme ve Katman Kontrol\u00fc<\/h3>\n<p>llama.cpp&#8217;de modeli nas\u0131l b\u00f6ld\u00fc\u011f\u00fcn\u00fcz konusunda a\u00e7\u0131k bir kontrol\u00fcn\u00fcz vard\u0131r. Katmanlar\u0131 GPU&#8217;ya ta\u015f\u0131ma say\u0131s\u0131n\u0131 belirtmek i\u00e7in <code>-ngl<\/code> bayra\u011f\u0131n\u0131 (offload edilecek katman say\u0131s\u0131) kullan\u0131rs\u0131n\u0131z.<\/p>\n<p>\u00c7oklu GPU kurulumlar\u0131nda tipik komut sat\u0131r\u0131 \u015f\u00f6yle g\u00f6r\u00fcn\u00fcr:<\/p>\n<pre><code class=\"language-bash\">.\/llama-server -m models\/llama-3-70b.Q4_K_M.gguf \\\n  --ctx-size 8192 \\\n  --n-ctx 8192 \\\n  --ngl 80 \\\n  --threads 12 \\\n  --gpu-layers 20\n<\/code><\/pre>\n<p>Durak, yukar\u0131daki komut tek bir GPU i\u00e7indir. \u00c7oklu GPU i\u00e7in llama.cpp farkl\u0131 bir mekanizma kullan\u0131r. GPU say\u0131s\u0131n\u0131 ve da\u011f\u0131l\u0131m\u0131 belirtmeniz gerekir. Son s\u00fcr\u00fcmlerde llama.cpp otomatik olarak birden fazla GPU&#8217;yu alg\u0131lar ve katmanlar\u0131 b\u00f6ler, ancak \u00e7evre de\u011fi\u015fkenleri veya <code>--split-mode<\/code> gibi belirli bayraklar kullanarak bunu manuel olarak da kontrol edebilirsiniz.<\/p>\n<p>\u00d6rne\u011fin, belirli bir b\u00f6l\u00fcnmeyi zorlamak isteyebilirsiniz:<\/p>\n<pre><code class=\"language-bash\">export LLAMA_MAX_BATCH_SIZE=128\n.\/llama-server -m models\/llama-3-70b.Q4_K_M.gguf --gpu-layers 40\n<\/code><\/pre>\n<p>Asl\u0131nda, ger\u00e7ek \u00e7oklu GPU da\u011f\u0131l\u0131m\u0131 i\u00e7in genellikle toplam katmanlara g\u00f6re <code>--ngl<\/code> bayra\u011f\u0131na g\u00fcvenmeniz gerekir. Toplam 80 katn\u0131n\u0131z ve iki GPU&#8217;nuz varsa, her GPU i\u00e7in <code>--ngl 40<\/code> ayarlayabilirsiniz; ancak llama.cpp da\u011f\u0131t\u0131m mant\u0131\u011f\u0131n\u0131 mevcut VRAM&#8217;e g\u00f6re kendi i\u00e7inde halleder. Modeli e\u015fit \u015fekilde b\u00f6lmeye \u00e7al\u0131\u015f\u0131r. Farkl\u0131 VRAM&#8217;lere sahipseniz (\u00f6rne\u011fin biri 24GB, di\u011feri 12GB), llama.cpp daha ak\u0131ll\u0131 davran\u0131r ve 24GB&#8217;l\u0131k karta daha fazla, 12GB&#8217;l\u0131k karta daha az katman y\u00fckler. Ama unutmay\u0131n, karars\u0131z donan\u0131m yap\u0131lar\u0131 her zaman beklenmedik hatalar yarat\u0131r.<\/p>\n<h3>D\u00fc\u015f\u00fck Seviye Kontrol ve Esneklik<\/h3>\n<p>llama.cpp&#8217;nin en b\u00fcy\u00fck avantaj\u0131 esnekliktir. \u00d6zel yap\u0131lar olu\u015fturabilirsiniz. Donan\u0131m\u0131n\u0131za \u00f6zel optimizasyonlar i\u00e7in belirli kernel&#8217;leri (CUDA, ROCm, Metal) aktif edebilir, kullan\u0131m durumunuza \u00f6zel bellek d\u00fczenini ince ayara getirebilirsiniz.<\/p>\n<p>\u00d6rne\u011fin, \u00e7ok b\u00fcy\u00fck bir ba\u011flam penceresi (\u00f6rn. 128k token) gerektiren bir model \u00e7al\u0131\u015ft\u0131r\u0131yorsan\u0131z, llama.cpp ba\u015fka motorlar\u0131n eri\u015fime a\u00e7mad\u0131\u011f\u0131 farkl\u0131 kayd\u0131rma penceresi tekniklerini veya KV \u00f6nbellek atma politikalar\u0131n\u0131 denemenize olanak tan\u0131r. Alternatif ara\u00e7lar genellikle bu d\u00fczeyde k\u0131s\u0131tland\u0131r\u0131lm\u0131\u015ft\u0131r.<\/p>\n<h3>Operat\u00f6r Notlar\u0131: Bak\u0131m Y\u00fck\u00fc<\/h3>\n<p>llama.cpp&#8217;yi do\u011frudan \u00e7al\u0131\u015ft\u0131rmak ciddi bir i\u015f y\u00fck\u00fcd\u00fcr. Binary&#8217;leri derlemeniz, model dosyalar\u0131n\u0131 y\u00f6netmeniz ve komut sat\u0131r\u0131 arg\u00fcmanlar\u0131n\u0131n do\u011fru oldu\u011fundan emin olman\u0131z gerekir. Bunu bir API \u00fczerinden sunmak istiyorsan\u0131z, <code>llama-server<\/code> binary&#8217;sini \u00e7al\u0131\u015ft\u0131r\u0131p HTTP isteklerini kendiniz y\u00f6netmeniz veya bunlar\u0131 <code>text-generation-webui<\/code> veya <code>oobabooga<\/code> gibi bir\u6846\u67b6\u0131n i\u00e7ine sarmalaman\u0131z gerekir.<\/p>\n<p>Ancak tam olarak ne oldu\u011funu anlamak isteyen operat\u00f6r i\u00e7in llama.cpp en iyi ara\u00e7t\u0131r. Loglar\u0131 g\u00f6rebilir, parametreleri ince ayarlayabilir ve spesifik donan\u0131m topolojinize g\u00f6re optimize edebilirsiniz. Di\u011ferleri sadece &#8220;\u00e7al\u0131\u015fs\u0131n&#8221; derken, siz sistemi y\u00f6netirsiniz. Beklenen performans ve stabilite i\u00e7in bu maliyeti \u00f6demelisiniz.<\/p>\n<h2>Don\u00f6r Topolojisi \u00d6nemlidir: NVLink vs. PCIe Darbo\u011fazlar\u0131<\/h2>\n<p>Ollama, vLLM veya llama.cpp aras\u0131nda se\u00e7im yapmak, fiziksel donan\u00f6r topolojisinden ikincildir. D\u00fcnyan\u0131n en iyi yaz\u0131l\u0131m\u0131na sahip olsan\u0131z bile, e\u011fer donan\u00f6r topolojiniz hatal\u0131ysa performans\u0131n\u0131z berbat olur. En kritik fakt\u00f6r GPU&#8217;lar aras\u0131ndaki ba\u011flant\u0131d\u0131r.<\/p>\n<h3>NVLink Avantaj\u0131 (ve Yoklu\u011fu)<\/h3>\n<p>Veri merkezlerinde GPU&#8217;lar NVLink (Nvidia Y\u00fcksek H\u0131zl\u0131 \u0130leti\u015fim) ile ba\u011flan\u0131r. NVLink, muazzam bant geni\u015fli\u011fi (H100 i\u00e7in 900 GB\/s&#8217;e kadar) ve d\u00fc\u015f\u00fck gecikme s\u00fcresi sa\u011flar. Bu, GPU&#8217;lar aras\u0131nda s\u0131k\u0131 bir birle\u015fim anlam\u0131na gelir. \u0130ki H100 aras\u0131nda NVLink ile \u00e7al\u0131\u015fan bir modeli \u00e7al\u0131\u015ft\u0131rd\u0131\u011f\u0131n\u0131zda, neredeyse tek devasa bir GPU gibi davran\u0131rlar. Overhead ihmal edilebilir d\u00fczeydedir.<\/p>\n<p>T\u00fcketici kartlar\u0131 (RTX 3090, 4090) NVLink&#8217;e sahip de\u011fildir. NVIDIA, t\u00fcketici kartlar\u0131ndan NVLink ba\u011flant\u0131s\u0131n\u0131 kald\u0131rd\u0131. PCIe bus&#8217;\u0131 ile ba\u015f ba\u015fa kal\u0131rs\u0131n\u0131z. Bu temel bir k\u0131s\u0131tt\u0131r.<\/p>\n<h3>PCIe Gen3 vs. Gen4 vs. Gen5<\/h3>\n<p>T\u00fcketici donan\u0131mda \u00e7oklu-GPU \u00e7\u0131kar\u0131m\u0131 i\u00e7in s\u0131n\u0131rlay\u0131c\u0131 fakt\u00f6r PCIe bus h\u0131z\u0131d\u0131r.<br \/>\n*   <strong>PCIe Gen3 x16:<\/strong> ~16 GB\/s teorik bant geni\u015fli\u011fi.<br \/>\n*   <strong>PCIe Gen4 x16:<\/strong> ~32 GB\/s teorik bant geni\u015fli\u011fi.<br \/>\n*   <strong>PCIe Gen5 x16:<\/strong> ~64 GB\/s teorik bant geni\u015fli\u011fi.<\/p>\n<p>Uygulamada, protokol overhadd\u0131 nedeniyle teorik bant geni\u015fli\u011finin %70-80&#8217;ini al\u0131rs\u0131n\u0131z.<\/p>\n<p>E\u011fer PCIe Gen4 x16 slotunda iki RTX 3090&#8217;\u0131n\u0131z varsa, teorik olarak 32 GB\/s&#8217;lik bir borunuz var demektir. 70B&#8217;lik bir model i\u00e7in veri hareketi yo\u011fundur. Bir katman\u0131n \u00e7\u0131kt\u0131s\u0131n\u0131 bir sonrakinin giri\u015fine ta\u015f\u0131yorsunuz. Model b\u00fcy\u00fckse, veri hareketi PCIe bus&#8217;\u0131 doyurabilir. Bu, GPU&#8217;lar\u0131n hesaplama yapmaktan ziyade veri bekledikleri anlam\u0131na gelir.<\/p>\n<h3>Ba\u011flam De\u011fi\u015ftirme \u00dczerindeki Gecikme Etkisi<\/h3>\n<p>\u00c7oklu-GPU kurulumlar\u0131nda her token \u00fcretimi, birden fazla veri aktar\u0131m turu i\u00e7erir. Y\u00fcksek gecikmeli bir ba\u011flant\u0131n\u0131z varsa (PCIe gibi), token ba\u015f\u0131na gecikme artar. Bu do\u011frudan &#8220;ilk tokena kadar ge\u00e7en s\u00fcre&#8221; (TTFT) ve toplam token \u00fcretim h\u0131z\u0131n\u0131 etkiler.<\/p>\n<p>Bir tek kullan\u0131c\u0131l\u0131 senaryo i\u00e7in bu kabule edilebilir olabilir. saniyede 5 token ile 8 token aras\u0131ndaki fark fark edilmeyebilir. Ancak y\u00fcksek e\u015fzamanl\u0131l\u0131k gerektiren bir sistemde bu gecikme birikir. 100 e\u015fzamanl\u0131 iste\u011finiz varsa ve her iste\u011fin PCIe aktar\u0131m\u0131n\u0131 beklemesi gerekiyorsa, sisteminiz yava\u015flar.<\/p>\n<h3>Kar\u015f\u0131la\u015ft\u0131rma: Veri Merkezi vs. T\u00fcketici<\/h3>\n<table>\n<thead>\n<tr>\n<th>\u00d6zellik<\/th>\n<th>Veri Merkezi (A100\/H100 + NVLink)<\/th>\n<th>T\u00fcketici (RTX 3090\/4090 + PCIe)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Ba\u011flant\u0131<\/strong><\/td>\n<td>NVLink (900 GB\/s)<\/td>\n<td>PCIe Gen4 (32 GB\/s)<\/td>\n<\/tr>\n<tr>\n<td><strong>Gecikme<\/strong><\/td>\n<td>\u00c7ok D\u00fc\u015f\u00fck<\/td>\n<td>Orta ile Y\u00fcksek Aras\u0131<\/td>\n<\/tr>\n<tr>\n<td><strong>\u00d6l\u00e7eklenebilirlik<\/strong><\/td>\n<td>Do\u011frusal (8+ GPU&#8217;ya kadar)<\/td>\n<td>Azalan Getiriler (1.5x &#8211; 1.8x)<\/td>\n<\/tr>\n<tr>\n<td><strong>Maliyet<\/strong><\/td>\n<td>Kart ba\u015f\u0131na $10.000+<\/td>\n<td>Kart ba\u015f\u0131na $1.500<\/td>\n<\/tr>\n<tr>\n<td><strong>G\u00fc\u00e7 T\u00fcketimi<\/strong><\/td>\n<td>Kart ba\u015f\u0131na ~300-450W<\/td>\n<td>Kart ba\u015f\u0131na ~250-350W<\/td>\n<\/tr>\n<tr>\n<td><strong>Stabilite<\/strong><\/td>\n<td>Kurumsal Derece<\/td>\n<td>T\u00fcketici Derecesi (Termal\/S\u00fcr\u00fcc\u00fc sorunlar\u0131)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>G\u00f6rd\u00fc\u011f\u00fcn\u00fcz gibi, t\u00fcketici kurulumu b\u00fcy\u00fck bir tavizdir. Para tasarrufu sa\u011flars\u0131n\u0131z ama performanstan kaybedersiniz. Benchmarklarda okudu\u011funuz &#8220;do\u011frusal \u00f6l\u00e7eklenebilirlik&#8221; t\u00fcketici kartlar i\u00e7in bir efsanedir.<\/p>\n<h2>Bellek Hiyerar\u015fisi: VRAM, RAM ve Takas Riskleri<\/h2>\n<p>\u00c7oklu GPU kulland\u0131\u011f\u0131n\u0131zda, karma\u015f\u0131k bir bellek hiyerar\u015fisiyle u\u011fra\u015f\u0131yorsunuz demektir. Her GPU \u00fczerinde VRAM, sistem RAM&#8217;i ve potansiyel olarak disk (takas) mevcuttur. \u00c7\u00f6kmeleri \u00f6nlemenin anahtar\u0131, bu katmanlar\u0131n nas\u0131l etkile\u015fime girdi\u011fini anlamakt\u0131r.<\/p>\n<h3>VRAM Hiyerar\u015fisi<\/h3>\n<p>\u00c7oklu GPU kurulumunda toplam VRAM, t\u00fcm GPUlardaki VRAM&#8217;lerin toplam\u0131d\u0131r. Ancak bunlar\u0131 toplay\u0131p t\u00fcm\u00fcn\u00fc do\u011frudan kullanabilece\u011finizi varsayamazs\u0131n\u0131z. Katmanlar b\u00f6l\u00fcnmelidir. \u0130ki adet 24GB kart\u0131n\u0131z olsa, toplamda 48GB&#8217;a sahipsiniz demektir. Fakat 48GB a\u011f\u0131rl\u0131\u011f\u0131 tek bir katmanda tutamazs\u0131n\u0131z. A\u011f\u0131rl\u0131klar kartlar aras\u0131nda b\u00f6l\u00fcn\u00fcr.<\/p>\n<p>E\u011fer bir katman, tek bir kart\u0131n VRAM kapasitesinden b\u00fcy\u00fckse, bu katman birden fazla karta b\u00f6l\u00fcnmek zorundad\u0131r. Bu da veri hareketini art\u0131r\u0131r. E\u011fer bir katman tek bir karta bile s\u0131\u011famazsa, sistem \u00e7\u00f6ker veya CPU&#8217;ya kayd\u0131rma (offloading) moduna d\u00fc\u015fer.<\/p>\n<h3>CPU Kayd\u0131rma ve RAM Takas\u0131<\/h3>\n<p>VRAM doldu\u011funda, \u00e7\u0131kar\u0131m motoru baz\u0131 katmanlar\u0131 sistem RAM&#8217;ine kayd\u0131rabilir. Buna &#8220;CPU kayd\u0131rma&#8221; denir. Bu, performans katili bir i\u015flemdir. RAM, VRAM&#8217;den \u00e7ok daha yava\u015ft\u0131r. Bant geni\u015fli\u011fi fark\u0131 10x ile 20x aras\u0131nda de\u011fi\u015fir.<\/p>\n<p>E\u011fer modeliniz toplam VRAM kapasitesinden b\u00fcy\u00fckse, iki se\u00e7ene\u011finiz var:<\/p>\n<ol>\n<li><strong>Quantize (Nicemleme):<\/strong> A\u011f\u0131rl\u0131\u011f\u0131n hassasiyetini d\u00fc\u015f\u00fcr\u00fcn (\u00f6rn. FP16&#8217;dan Q4_K_M&#8217;e). Bu, bellek ayak izini azalt\u0131r.<\/li>\n<li><strong>Offload (Kayd\u0131rma):<\/strong> Baz\u0131 katmanlar\u0131 RAM&#8217;e ta\u015f\u0131y\u0131n.<\/li>\n<\/ol>\n<p>E\u011fer kayd\u0131rma se\u00e7ene\u011fini se\u00e7erseniz, token \u00fcretme h\u0131z\u0131n\u0131z dramatik \u015fekilde d\u00fc\u015fer. 1 saniyede 10 tokenden 1 tokene gerileyebilirsiniz. Bunun sebebi CPU&#8217;nun RAM&#8217;i beklemesidir. Alternatif olarak bellek y\u00f6netimini elden ge\u00e7irip RAM&#8217;e y\u00fck\u00fc azaltabilirsiniz, ama o zaman da i\u015flemci g\u00fcc\u00fc gerekir.<\/p>\n<h3>Swap Dosyalar\u0131 Riski<\/h3>\n<p>E\u011fer sistem RAM&#8217;i dolarsa, i\u015fletim sistemi diski takas (swap) alan\u0131 olarak kullanmaya ba\u015flar. Bu, LLM \u00e7\u0131kar\u0131m\u0131 i\u00e7in felakettir. Disk G\/\u00c7 i\u015flemi, RAM&#8217;e k\u0131yasla iki b\u00fcy\u00fckl\u00fck derecesi daha yava\u015ft\u0131r. Sistem donar ve verileriniz kaybolabilir.<\/p>\n<p>Bunu \u00f6nlemek i\u00e7in, kayd\u0131r\u0131lan katmanlar\u0131 tutabilecek kadar sistem RAM&#8217;inize sahip oldu\u011funuzdan emin olmal\u0131s\u0131n\u0131z. 40GB civar\u0131 yer kaplayan Q4_K_M format\u0131nda bir 70B modeliniz varsa ve 64GB RAM&#8217;iniz varsa, i\u015fletim sistemi ve di\u011fer i\u015flemler i\u00e7in geriye sadece 24GB kal\u0131r. Bu s\u0131k\u0131\u015f\u0131k bir durumdur. Sistem bellek kullan\u0131m\u0131n\u0131 s\u00fcrekli izlemeniz gerekir.<\/p>\n<h3>Operat\u00f6r Notlar\u0131: &#8220;Swap&#8221; Pani\u011fi<\/h3>\n<p>Sisteminizde takas (swap) g\u00f6rd\u00fc\u011f\u00fcn\u00fcz an, \u00e7\u0131kar\u0131m\u0131 hemen durdurun. Takas veriyi bozabilir ve GPU s\u00fcr\u00fcc\u00fcs\u00fcn\u00fc \u00e7\u00f6kebilir. Her zaman bir g\u00fcvenlik pay\u0131 b\u0131rak\u0131n. RAM&#8217;inizin %100&#8217;\u00fcn\u00fc kullanmay\u0131n. \u0130\u015fletim sistemi i\u00e7in en az %10-15 bo\u015fluk b\u0131rak\u0131n. Alternatif olarak sanal bellek boyutunu k\u0131s\u0131tlay\u0131p sistemi korumaya almak da bir se\u00e7enektir.<\/p>\n<h2>Da\u011f\u0131t\u0131m \u00d6ncesi Donan\u0131m Denetimi: Stabilite i\u00e7in Kontrol Listesi<\/h2>\n<p>Modelleri \u00e7al\u0131\u015ft\u0131rmaya ba\u015flamadan \u00f6nce donan\u0131m\u0131 denetlemeniz \u015fart. Bu bir l\u00fcks de\u011fil. Ba\u011f\u0131rs\u0131z\u0131z denetim, da\u011f\u0131t\u0131m\u0131n \u00e7\u00f6kmesiyle sonu\u00e7lan\u0131r. Kurulumunuzu do\u011frulamak i\u00e7in bu listeyi kullan\u0131n.<\/p>\n<ul>\n<li>\u2610 <strong>GPU Uyumlulu\u011fu:<\/strong> T\u00fcm GPU&#8217;lar\u0131n NVIDIA veya AMD (tutarl\u0131) oldu\u011fundan emin olun. Ayn\u0131 \u00e7oklu-GPU kurulumunda NVIDIA ve AMD kartlar\u0131 kar\u0131\u015ft\u0131rmay\u0131n (s\u00fcr\u00fcc\u00fc uyumsuzluklar\u0131 ka\u00e7\u0131n\u0131lmaz olur).<\/li>\n<li>\u2610 <strong>PCIe Topolojisi:<\/strong> T\u00fcm GPU&#8217;lar\u0131n ayn\u0131 PCIe k\u00f6k k\u00fcmesinde olup olmad\u0131\u011f\u0131n\u0131 kontrol edin. NVLink veya y\u00fcksek h\u0131zl\u0131 bir ara ba\u011flay\u0131c\u0131n\u0131z yoksa, GPU&#8217;lar\u0131 farkl\u0131 CPU yuvalar\u0131na b\u00f6lmekten ka\u00e7\u0131n\u0131n. Intel&#8217;nin QuickPath veya AMD&#8217;nin Infinity Fabric&#8217;i gibi alternatifler bile tam bant geni\u015fli\u011fi sa\u011flamaz, gecikme artar.<\/li>\n<li>\u2610 <strong>G\u00fc\u00e7 Kayna\u011f\u0131:<\/strong> PSU&#8217;nun yeterli marj\u0131 oldu\u011fundan emin olun. \u0130ki adet 3090, anl\u0131k olarak 700W&#8217;\u0131 ge\u00e7ebilir. Yeterli raylere sahip en az 1000W+ bir PSU \u015fartt\u0131r. Ucuz veya kalitesiz g\u00fc\u00e7 kaynaklar\u0131 y\u00fck alt\u0131nda an\u0131nda kilitlenir.<\/li>\n<li>\u2610 <strong>So\u011futma:<\/strong> Hava ak\u0131\u015f\u0131n\u0131 do\u011frulay\u0131n. K\u00fc\u00e7\u00fck bir kasa i\u00e7inde iki GPU \u0131s\u0131ya kar\u015f\u0131 diren\u00e7 g\u00f6stermez. S\u0131v\u0131 so\u011futma veya y\u00fcksek statik bas\u0131n\u00e7l\u0131 fanlar \u015fartt\u0131r. Fanlar t\u0131kal\u0131 oldu\u011funda sistem koruma kipiyle kapan\u0131r, bu bir sorun de\u011fil, bir kurtar\u0131c\u0131d\u0131r.<\/li>\n<li>\u2610 <strong>S\u00fcr\u00fcc\u00fc S\u00fcr\u00fcm\u00fc:<\/strong> CUDA s\u00fcr\u00fcc\u00fclerinin g\u00fcncel oldu\u011fundan emin olun. Eski s\u00fcr\u00fcc\u00fcler, OOM (Yetersiz Bellek) hatalar\u0131n\u0131n en yayg\u0131n sebebidir. Neden? \u00c7\u00fcnk\u00fc yeni modeller eski s\u00fcr\u00fcc\u00fclerin bellek y\u00f6netimini atlar.<\/li>\n<li>\u2610 <strong>Sistem RAM&#8217;i:<\/strong> \u00d6nekleme (offloading) i\u00e7in yeterli sistem RAM&#8217;iniz oldu\u011fundan emin olun. Beklenen \u00f6nekleme durumunda modeli boyutunun en az 2 kat\u0131 RAM hedefleyin. DDR5 kullanmak zorunda de\u011filseniz, eski DDR4 bile yeterli olabilir ama h\u0131z fark\u0131n\u0131 g\u00f6receksiniz.<\/li>\n<li>\u2610 <strong>Swap Alan\u0131:<\/strong> Swap&#8217;\u0131 devre d\u0131\u015f\u0131 b\u0131rak\u0131n veya \u00e7\u00f6k\u00fc\u015fleri \u00f6nlemek i\u00e7in \u00e7ok b\u00fcy\u00fck bir dosya olarak ayarlay\u0131n (ama de\u011fi\u015ftirme i\u015fleminden tamamen ka\u00e7\u0131nmak en iyisidir). Swap kullan\u0131m\u0131, SSD \u00f6mr\u00fcn\u00fc h\u0131zla t\u00fcketir ve performans\u0131 fel\u00e7 eder. Sanal belle\u011fe ihtiyac\u0131n\u0131z varsa, fiziksel RAM eksikli\u011fine dikkat edin.<\/li>\n<\/ul>\n<h2>\u00c7oklu GPU Konfig\u00fcrasyonu Do\u011frulama: Ad\u0131m Ad\u0131m Rehber<\/h2>\n<p>Donan\u0131m denetimi tamamland\u0131ktan sonra, yaz\u0131l\u0131m\u0131n do\u011fru yap\u0131land\u0131r\u0131ld\u0131\u011f\u0131n\u0131 teyit etmelisin. \u00c7oklu GPU kurulumunun beklendi\u011fi gibi \u00e7al\u0131\u015fmas\u0131n\u0131 sa\u011flamak i\u00e7in \u015fu ad\u0131mlar\u0131 izle.<\/p>\n<ul>\n<li>\u2610 <strong>Cihaz Tan\u0131ma:<\/strong> T\u00fcm GPU&#8217;lar\u0131n g\u00f6r\u00fcn\u00fcr ve sa\u011fl\u0131kl\u0131 oldu\u011funu do\u011frulamak i\u00e7in <code>nvidia-smi<\/code> komutunu \u00e7al\u0131\u015ft\u0131r. Alternatif olarak, s\u00fcr\u00fcc\u00fc sorunlar\u0131nda `lspci`&#8217;yi kontrol et ama as\u0131l kan\u0131t burada.<\/li>\n<li>\u2610 <strong>CUDA Ba\u011flam\u0131:<\/strong> T\u00fcm cihazlarda CUDA ba\u011flam\u0131n\u0131n olu\u015fturuldu\u011fundan emin ol. Bu basit ad\u0131ml\u0131 atlanmazsa sistem sessizce hata vermez ama i\u015flemez.<\/li>\n<li>\u2610 <strong>Katman Ayr\u0131m\u0131:<\/strong> Katmanlar\u0131n GPU&#8217;lar aras\u0131nda b\u00f6l\u00fcnd\u00fc\u011f\u00fcn\u00fc teyit etmek i\u00e7in loglara bak. &#8220;Layer 0-20 on GPU 0, Layer 21-50 on GPU 1&#8221; gibi mesajlar\u0131 ara. Bir kart\u0131n y\u00fck\u00fc ta\u015f\u0131rken di\u011feri bo\u015fsa, ayr\u0131m mant\u0131\u011f\u0131n bozuk demektir.<\/li>\n<li>\u2610 <strong>Bellek Kullan\u0131m\u0131:<\/strong> Her kart\u0131n VRAM kullan\u0131m\u0131n\u0131 izle. Dengeli olmal\u0131. Bir kart t\u0131ka doluyken di\u011ferinin bo\u015f olmas\u0131, payla\u015f\u0131m mant\u0131\u011f\u0131n\u0131n yanl\u0131\u015f kuruldu\u011funun en net kan\u0131t\u0131d\u0131r.<\/li>\n<li>\u2610 <strong>Performans Taban\u0131:<\/strong> K\u00fc\u00e7\u00fck bir test iste\u011fi g\u00f6nder. Token olu\u015fturma h\u0131z\u0131n\u0131 \u00f6l\u00e7. Tek GPU performans\u0131yla k\u0131yaslayarak h\u0131zlanma (speedup) elde edip etmedi\u011fini netle\u015ftir. Tekil bir GPU&#8217;nun \u0131s\u0131nma sorunuyla yava\u015flamas\u0131, \u00e7iftli\u011fin avantaj\u0131n\u0131 maskeleyebilir.<\/li>\n<li>\u2610 <strong>Stres Testi:<\/strong> 30 dakikal\u0131k uzun s\u00fcreli bir \u00e7\u0131kar\u0131m (inference) \u00e7al\u0131\u015ft\u0131r. Termal darbe (throttling) veya s\u00fcr\u00fcc\u00fc \u00e7\u00f6k\u00fc\u015flerini kontrol et. \u0130lk bozulan genellikle so\u011futma sistemidir, sadece yaz\u0131l\u0131ma g\u00fcvenme.<\/li>\n<li>\u2610 <strong>Paralellik Testi:<\/strong> Ayn\u0131 anda birden fazla istek g\u00f6nder. Sistemin y\u00fck alt\u0131nda \u00e7\u00f6kmeden y\u00f6netebildi\u011fini teyit et. Birden fazla ak\u0131\u015fta bellek bariyeri olmaks\u0131z\u0131n \u00e7al\u0131\u015fmazsa, mimarin ger\u00e7ek bir \u00e7oklu kullan\u0131c\u0131 senaryosu i\u00e7in haz\u0131r de\u011fil demektir.<\/li>\n<\/ul>\n<h2>Inference Motoru \u00d6zellik Matrisi<\/h2>\n<p>Do\u011fru motoru se\u00e7menize yard\u0131mc\u0131 olmak i\u00e7in Ollama, vLLM ve llama.cpp&#8217;nin ana metrikler \u00fczerindeki do\u011frudan kar\u015f\u0131la\u015ft\u0131rmas\u0131. Not edin: En ucuz \u00e7\u00f6z\u00fcm her zaman en az sorunu ya\u015famaz; beklenen \u00f6m\u00fcr ve faturalara g\u00f6re karar verin.<\/p>\n<table>\n<thead>\n<tr>\n<th>\u00d6zellik<\/th>\n<th>Ollama<\/th>\n<th>vLLM<\/th>\n<th>llama.cpp<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Kullan\u0131m Kolayl\u0131\u011f\u0131<\/strong><\/td>\n<td>Y\u00fcksek (Tek komut)<\/td>\n<td>Orta (Yap\u0131land\u0131rma gerekir)<\/td>\n<td>D\u00fc\u015f\u00fck (CLI \/ \u00d6zel derleme)<\/td>\n<\/tr>\n<tr>\n<td><strong>\u00c7oklu GPU Deste\u011fi<\/strong><\/td>\n<td>Temel (Katman ayr\u0131m\u0131)<\/td>\n<td>Geli\u015fmi\u015f (TP\/PP)<\/td>\n<td>Geli\u015fmi\u015f (Katman ayr\u0131m\u0131)<\/td>\n<\/tr>\n<tr>\n<td><strong>\u0130\u015fletme E\u015fzamanl\u0131l\u0131\u011f\u0131<\/strong><\/td>\n<td>D\u00fc\u015f\u00fck<\/td>\n<td>Y\u00fcksek<\/td>\n<td>Orta<\/td>\n<\/tr>\n<tr>\n<td><strong>Ge\u00e7ikme (Latency)<\/strong><\/td>\n<td>D\u00fc\u015f\u00fck<\/td>\n<td>D\u00fc\u015f\u00fck<\/td>\n<td>\u00c7ok D\u00fc\u015f\u00fck<\/td>\n<\/tr>\n<tr>\n<td><strong>Bellek Verimlili\u011fi<\/strong><\/td>\n<td>Orta<\/td>\n<td>Y\u00fcksek (PagedAttention)<\/td>\n<td>\u00c7ok Y\u00fcksek (GGUF)<\/td>\n<\/tr>\n<tr>\n<td><strong>Donan\u0131m Esnekli\u011fi<\/strong><\/td>\n<td>D\u00fc\u015f\u00fck (Sadece CUDA)<\/td>\n<td>Orta (CUDA\/ROCm)<\/td>\n<td>Y\u00fcksek (CUDA\/ROCm\/Metal)<\/td>\n<\/tr>\n<tr>\n<td><strong>Bak\u0131m Y\u00fck\u00fc<\/strong><\/td>\n<td>D\u00fc\u015f\u00fck<\/td>\n<td>Orta<\/td>\n<td>Y\u00fcksek<\/td>\n<\/tr>\n<tr>\n<td><strong>En Uygun Oldu\u011fu Kullan\u0131m<\/strong><\/td>\n<td>Ki\u015fisel kullan\u0131m, tek kullan\u0131c\u0131l\u0131<\/td>\n<td>API sunumu, y\u00fcksek e\u015fzamanl\u0131l\u0131k<\/td>\n<td>\u00d6zel derlemeler, maksimum kontrol<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Topolojiye G\u00f6re \u00c7oklu GPU Performans Beklentileri<\/h2>\n<p>\u00c7oklu GPU kurulumlar\u0131nda performans art\u0131\u015f\u0131 bus topolojisine do\u011frudan ba\u011fl\u0131d\u0131r. Donan\u0131ma g\u00f6re h\u0131zland\u0131rma oranlar\u0131 ciddi farkl\u0131l\u0131klar g\u00f6sterir. Alternatif olarak PCIe \u00fczerinden yap\u0131lan ba\u011flant\u0131lar\u0131n maliyeti daha d\u00fc\u015f\u00fck olsa da, NVLink gibi \u00f6zel ba\u011flant\u0131larda ortaya \u00e7\u0131kacak darbo\u011fazlar sistemin t\u00fcm \u00f6mr\u00fc boyunca yava\u015flat\u0131c\u0131 etki yapar. Beklentinizi geride b\u0131rakacak en zay\u0131f halka hemen fark edilir.<\/p>\n<table>\n<thead>\n<tr>\n<th>Topoloji<\/th>\n<th>Bant Geni\u015fli\u011fi<\/th>\n<th>Beklenti H\u0131zland\u0131rma<\/th>\n<th>Notlar<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>NVLink (H100)<\/strong><\/td>\n<td>900 GB\/s<\/td>\n<td>1.8x &#8211; 2.0x<\/td>\n<td>Neredeyse do\u011frusal \u00f6l\u00e7ekleme.<\/td>\n<\/tr>\n<tr>\n<td><strong>PCIe Gen5 x16<\/strong><\/td>\n<td>64 GB\/s<\/td>\n<td>1.5x &#8211; 1.8x<\/td>\n<td>\u0130yi \u00f6l\u00e7ekleme, d\u00fc\u015f\u00fck gecikme.<\/td>\n<\/tr>\n<tr>\n<td><strong>PCIe Gen4 x16<\/strong><\/td>\n<td>32 GB\/s<\/td>\n<td>1.3x &#8211; 1.6x<\/td>\n<td>Orta seviye \u00f6l\u00e7ekleme.<\/td>\n<\/tr>\n<tr>\n<td><strong>PCIe Gen3 x16<\/strong><\/td>\n<td>16 GB\/s<\/td>\n<td>1.1x &#8211; 1.4x<\/td>\n<td>Azalan verimlilik.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Not: Bunlar 70B parametreye sahip bir model i\u00e7in tahminlerdir. Daha k\u00fc\u00e7\u00fck modellerde bu fayda daha azd\u0131r, daha b\u00fcy\u00fck modellerde ise VRAM ihtiyac\u0131ndan dolay\u0131 daha fazla fayda sa\u011flanabilir. Ancak donan\u0131m\u0131n eski olmas\u0131 durumunda so\u011futma ve elektrik maliyetleri performans art\u0131\u015f\u0131yla orant\u0131l\u0131 olmayacakt\u0131r.<\/p>\n<h2>Donan\u0131m Ar\u0131za Modu Analizi<\/h2>\n<p>\u00c7oklu GPU kurulumlar\u0131 belirli ar\u0131za modlar\u0131na yatk\u0131nd\u0131r. Bunu anlamak saatler s\u00fcren sorun giderme s\u00fcrecini kurtar\u0131r. \u00c7oklu kartl\u0131 yap\u0131lar tek bir karta g\u00f6re daha fazla nokta at\u0131\u015f\u0131 ar\u0131za \u00fcretir; so\u011futma ve g\u00fc\u00e7 kayna\u011f\u0131 ilk olarak g\u00f6\u00e7 eder.<\/p>\n<table>\n<thead>\n<tr>\n<th>Ar\u0131za Modu<\/th>\n<th>Belirtiler<\/th>\n<th>Neden<\/th>\n<th>Y\u00f6netim<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Termal Darbelene (Thermal Throttling)<\/strong><\/td>\n<td>Yava\u015f token \u00fcretimi, y\u00fcksek s\u0131cakl\u0131k<\/td>\n<td>K\u00f6t\u00fc hava ak\u0131\u015f\u0131, y\u00fcksek ortam s\u0131cakl\u0131\u011f\u0131<\/td>\n<td>So\u011futmay\u0131 g\u00fc\u00e7lendirin, voltaj\u0131 d\u00fc\u015f\u00fcr\u00fcn (undervolt). Alternatif olarak fan h\u0131z\u0131n\u0131 sabitlemek yerine termal duruma g\u00f6re dinamik kontrol kullan\u0131n.<\/td>\n<\/tr>\n<tr>\n<td><strong>Driver \u00c7\u00f6kmesi<\/strong><\/td>\n<td>OOM hatas\u0131, sistem tak\u0131lma<\/td>\n<td>S\u00fcr\u00fcc\u00fc karars\u0131zl\u0131\u011f\u0131, bellek s\u0131z\u0131nt\u0131s\u0131<\/td>\n<td>S\u00fcr\u00fcc\u00fcleri g\u00fcncelleyin, servisi yeniden ba\u015flat\u0131n. Bilinmeyen bir g\u00fcncellemeden \u00f6nce test ortam\u0131nda do\u011frulay\u0131n; en az\u0131ndan s\u00fcr\u00fcm ge\u00e7mi\u015fini bilin.<\/td>\n<\/tr>\n<tr>\n<td><strong>PCIe T\u0131kanma (Stall)<\/strong><\/td>\n<td>D\u00fc\u015f\u00fck aktarma h\u0131z\u0131, y\u00fcksek gecikme<\/td>\n<td>Yol (bus) t\u0131kan\u0131kl\u0131\u011f\u0131, k\u00f6t\u00fc kablo<\/td>\n<td>PCIe yol h\u0131z\u0131n\u0131 kontrol edin, Gen4 kullan\u0131n. Kablo kalitesinden ve slot yerle\u015fiminden emin olun; \u00e7o\u011fu zaman kablo veya slot de\u011fi\u015ftirme en h\u0131zl\u0131 \u00e7\u00f6z\u00fcmd\u00fcr.<\/td>\n<\/tr>\n<tr>\n<td><strong>G\u00fc\u00e7 Darbesi<\/strong><\/td>\n<td>Sistem yeniden ba\u015flatma, kapanma<\/td>\n<td>G\u00fc\u00e7 kayna\u011f\u0131 a\u015f\u0131r\u0131 y\u00fckleme<\/td>\n<td>G\u00fc\u00e7 kayna\u011f\u0131n\u0131 y\u00fckseltin, g\u00fc\u00e7 \u00e7ekimini kontrol edin. PSU&#8217;yu sadece y\u00fckleme de\u011fil, kal\u0131nt\u0131 y\u00fckleme (headroom) ile se\u00e7in; \u00f6mr\u00fcn\u00fc uzatmak i\u00e7in %70&#8217;in alt\u0131na d\u00fc\u015f\u00fcrmek daha ak\u0131ll\u0131ca.<\/td>\n<\/tr>\n<tr>\n<td><strong>Katman Ayr\u0131lma Hatas\u0131<\/strong><\/td>\n<td>Model y\u00fcklenemedi<\/td>\n<td>VRAM e\u015fle\u015fmesi yok<\/td>\n<td><code>--ngl<\/code> parametresini ayarlay\u0131n veya quantization kullan\u0131n. Katmanlar\u0131 VRAM&#8217;e s\u0131k\u0131\u015ft\u0131rmak yerine modelin boyutunu ve quantization seviyesini \u00f6nceden hesaplay\u0131n; aksi halde sistem donar.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Sorun Giderme: \u00c7oklu GPU Ba\u015far\u0131s\u0131z Oldu\u011funda<\/h2>\n<p>En iyi planlamaya ra\u011fmen \u015feyler ters gider. \u0130\u015fte s\u0131k ya\u015fanan sorunlar ve \u00e7\u00f6z\u00fcmleri.<\/p>\n<h3>Sorun 1: GPU 0&#8217;da &#8220;CUDA out of memory&#8221;<\/h3>\n<p>Bu genellikle modelin do\u011fru b\u00f6l\u00fcnmedi\u011fini g\u00f6sterir. GPU 0, \u00e7ok fazla katman\u0131 y\u00fcklemeye \u00e7al\u0131\u015f\u0131yor.<br \/>\n*   <strong>D\u00fczeltme:<\/strong> \u0130kinci GPU i\u00e7in <code>--ngl<\/code> parametresini art\u0131r\u0131n veya modeli manuel olarak b\u00f6l\u00fcn. E\u011fer tek GPU&#8217;ya s\u0131\u011fm\u0131yorsa alternatif olarak model boyutunu k\u00fc\u00e7\u00fcltmek veya belle\u011fi art\u0131rmak daha mant\u0131kl\u0131 bir \u00e7\u00f6z\u00fcm olabilir.<\/p>\n<h3>Sorun 2: Performans, tek GPU&#8217;ya g\u00f6re daha yava\u015f<\/h3>\n<p>Bu durum genellikle PCIe bant geni\u015fli\u011finin darbo\u011faz yaratt\u0131\u011f\u0131nda ortaya \u00e7\u0131kar.<br \/>\n*   <strong>D\u00fczeltme:<\/strong> Modeli tek bir GPU&#8217;ya s\u0131\u011fd\u0131rabilece\u011finizden emin olun. Yoksa, gecikmeyi kabullenin. PCIe yollar\u0131n\u0131n Gen4 modunda \u00e7al\u0131\u015ft\u0131\u011f\u0131n\u0131 kontrol edin. Alternatif olarak, model boyutunu PCIe slot h\u0131z\u0131n\u0131za g\u00f6re yeniden boyutland\u0131rmay\u0131 d\u00fc\u015f\u00fcnebilirsiniz.<\/p>\n<h3>Sorun 3: Bir GPU bo\u015f, di\u011feri tam dolu<\/h3>\n<p>Bu k\u00f6t\u00fc bir b\u00f6l\u00fcnme i\u015faretidir.<br \/>\n*   <strong>D\u00fczeltme:<\/strong> Loglar\u0131 kontrol edin. Modelin e\u015fit \u015fekilde b\u00f6l\u00fcnd\u00fc\u011f\u00fcnden emin olun. Modelin katman boyutlar\u0131 d\u00fczensizse, b\u00f6l\u00fcnmeyi ayarlaman\u0131z gerekebilir. Bu durumda model da\u011f\u0131t\u0131m\u0131n\u0131 elle d\u00fczenlemek en kesin \u00e7\u00f6z\u00fcmd\u00fcr.<\/p>\n<h3>Sorun 4: Sistem birka\u00e7 dakika sonra \u00e7\u00f6ker<\/h3>\n<p>Bu muhtemelen \u0131s\u0131l veya g\u00fc\u00e7 sorunu.<br \/>\n*   <strong>D\u00fczeltme:<\/strong> S\u0131cakl\u0131klar\u0131 kontrol edin. PSU&#8217;nun a\u015f\u0131r\u0131 y\u00fcklenmedi\u011finden emin olun. Ger\u00e7ek\u00e7i beklenmedik bir sorun, k\u0131sa s\u00fcrede voltaj d\u00fc\u015f\u00fc\u015f\u00fcne yol a\u00e7an zay\u0131f g\u00fc\u00e7 kaynaklar\u0131d\u0131r; bunlar\u0131 de\u011fi\u015ftirmek zorundas\u0131n\u0131z.<\/p>\n<h2>Pratik Senaryo: \u0130ki RTX 3090 \u00dczerinde 70B Modeli<\/h2>\n<p>Tamam, elinde iki RTX 3090 (her biri 24GB) var ve Llama 3 70B&#8217;yi \u00e7al\u0131\u015ft\u0131rmak istiyorsun.<\/p>\n<ol>\n<li><strong>Model Boyutu:<\/strong> Q4_K_M quantize&#8217;\u0131nda 70B model yakla\u015f\u0131k 40GB yer kaplar.<\/li>\n<li><strong>Toplam VRAM:<\/strong> 48GB.<\/li>\n<li><strong>B\u00f6l\u00fcnme:<\/strong> Modeli ikiye b\u00f6lersin: 20GB GPU 0&#8217;da, 20GB GPU 1&#8217;de.<\/li>\n<li><strong>Performans:<\/strong> Tek bir 24GB kart (ki bu modeli hi\u00e7 \u00e7al\u0131\u015ft\u0131ramaz) yerine 1.5x h\u0131z beklentisi.<\/li>\n<li><strong>Ger\u00e7eklik:<\/strong> PCIe Gen4 darbo\u011faz\u0131 y\u00fcz\u00fcnden elde etti\u011fin h\u0131z 1.3x olur. Beklentini bo\u015fa \u00e7\u0131karma.<\/li>\n<li><strong>G\u00fc\u00e7:<\/strong> Sistem 700W \u00e7eker.<\/li>\n<li><strong>Maliyet:<\/strong> Elektrik faturas\u0131na ayda ek $50 yans\u0131yacak.<\/li>\n<li><strong>Yarg\u0131:<\/strong> \u00c7al\u0131\u015f\u0131yor evet, ama pahal\u0131 ve g\u00fcr\u00fclt\u00fcl\u00fc. Tek bir kartta 34B model \u00e7al\u0131\u015ft\u0131rabiliyorsan, bunu tercih et. Alternatifler her zaman daha mant\u0131kl\u0131d\u0131r.<\/li>\n<\/ol>\n<h2>S\u0131k\u00e7a Sorulan Sorular<\/h2>\n<p><strong>\u00c7oklu GPU \u00e7\u0131kar\u0131m\u0131 i\u00e7in NVIDIA ve AMD kartlar\u0131n\u0131 birle\u015ftirilebilir mi?<\/strong><br \/>\nGenellikle hay\u0131r. \u00c7o\u011fu \u00e7\u0131kar\u0131m y\u0131\u011f\u0131n yaz\u0131l\u0131m\u0131 (Ollama, vLLM, llama.cpp) NVIDIA i\u00e7in CUDA ve AMD i\u00e7in ROCm kullan\u0131r. Bu s\u00fcr\u00fcc\u00fc ekosistemleri tek bir \u00e7\u0131kar\u0131m i\u015fiyle uyumlu de\u011fildir. Bir modeli ayn\u0131 i\u015flemde NVIDIA ve AMD kartlar\u0131 aras\u0131nda b\u00f6lemezsiniz. Ayr\u0131 instance&#8217;lar olarak \u00e7al\u0131\u015ft\u0131rmak zorunda kal\u0131rs\u0131n\u0131z ki bu, \u00e7oklu GPU \u00f6l\u00e7eklendirmesinin amac\u0131n\u0131 ortadan kald\u0131r\u0131r.<\/p>\n<p><strong>\u0130kinci bir GPU eklemek h\u0131z\u0131 her zaman iki kat\u0131na \u00e7\u0131kar\u0131r m\u0131?<\/strong><br \/>\nHay\u0131r. PCIe bant geni\u015fli\u011fi darbo\u011faz\u0131 ve senkronizasyon y\u00fck\u00fc nedeniyle h\u0131z art\u0131\u015f\u0131 nadiren do\u011frusald\u0131r. Genellikle y\u0131\u011f\u0131n topolojisine ba\u011fl\u0131 olarak %1.3 ile %1.8 aras\u0131nda bir art\u0131\u015f elde edersiniz. Baz\u0131 durumlarda, model \u00e7ok k\u00fc\u00e7\u00fckse, y\u00fck nedeniyle ikinci bir GPU eklemek sizi yava\u015flatabilir.<\/p>\n<p><strong>Toplam VRAM&#8217;\u0131 a\u015fan modellerle nas\u0131l ba\u015fa \u00e7\u0131k\u0131l\u0131r?<\/strong><br \/>\n\u00dc\u00e7 se\u00e7ene\u011finiz var:<br \/>\n1.  <strong>Quantize:<\/strong> Haf\u0131za izd\u00fc\u015f\u00fcm\u00fcn\u00fc azaltmak i\u00e7in daha d\u00fc\u015f\u00fck hassasiyet kullan\u0131n (\u00f6rn. Q3_K_S).<br \/>\n2.  <strong>Offload:<\/strong> Katmanlar\u0131 sistem RAM&#8217;ine ta\u015f\u0131y\u0131n. Bu, \u00e7\u0131kar\u0131m\u0131 ciddi \u015fekilde yava\u015flat\u0131r.<br \/>\n3.  <strong>Da\u011f\u0131l\u0131n:<\/strong> \u0130ki&#8217;den fazla kart\u0131n\u0131z varsa, katmanlar\u0131 daha fazla kart aras\u0131nda da\u011f\u0131tabilirsiniz.<\/p>\n<p><strong>\u00c7oklu GPU performans\u0131n\u0131 izlemenin en iyi yolu nedir?<\/strong><br \/>\nGPU kullan\u0131m\u0131 ve haf\u0131zas\u0131n\u0131 izlemek i\u00e7in <code>nvtop<\/code> veya <code>htop<\/code> kullan\u0131n. Katman da\u011f\u0131t\u0131m mesajlar\u0131 i\u00e7in \u00e7\u0131kar\u0131m motorunuzun loglar\u0131n\u0131 kontrol edin. Token\/saniye h\u0131z\u0131n\u0131 \u00f6l\u00e7mek i\u00e7in bir benchmark arac\u0131n\u0131 kullan\u0131n.<\/p>\n<p><strong>PCIe Gen5 anakarta y\u00fckseltmek mant\u0131kl\u0131 m\u0131?<\/strong><br \/>\nDuruma ba\u011fl\u0131. E\u011fer PCIe Gen4 kart\u0131n\u0131z varsa, Gen5 fazla bir \u015fey katmayacakt\u0131r. Gen4 kartlar\u0131n\u0131z ve Gen5 anakart\u0131n\u0131z varsa, k\u00fc\u00e7\u00fck bir art\u0131\u015f elde edebilirsiniz. Ancak Gen5 anakart ve CPU maliyetleri y\u00fcksek. \u00c7o\u011fu homolab i\u00e7in Gen4 yeterlidir.<\/p>\n<h2>Sonu\u00e7: \u0130\u015fe En Uygun Aleti Se\u00e7mek<\/h2>\n<p>Ollama, vLLM ve llama.cpp aras\u0131nda \u00e7oklu GPU \u00e7\u0131kar\u0131m\u0131 i\u00e7in yap\u0131lacak tercih, &#8220;en iyi&#8221; olan\u0131 se\u00e7mek de\u011fil, k\u0131s\u0131tlar\u0131n\u0131za en uygun olan\u0131 bulmak meselesidir.<\/p>\n<ul>\n<li><strong>Ollama&#8217;y\u0131 se\u00e7in<\/strong> e\u011fer ki\u015fisel kullan\u0131m veya k\u00fc\u00e7\u00fck bir ekip i\u00e7in basit, kullan\u0131c\u0131 dostu bir kurulum istiyorsan\u0131z. Karma\u015f\u0131kl\u0131\u011f\u0131 sizin i\u00e7in halleder, ancak esneklik ve e\u015fzamanl\u0131 i\u015flem kapasitesinden feragat edersiniz.<\/li>\n<li><strong>vLLM&#8217;i se\u00e7in<\/strong> e\u011fer y\u00fcksek i\u015f hacmi, d\u00fc\u015f\u00fck gecikme ve e\u015fzamanl\u0131 isteklere ihtiyac\u0131n\u0131z varsa. \u00dcretim benzeri bir homelab sunucusu i\u00e7in en iyi se\u00e7enek budur; ancak daha fazla yap\u0131land\u0131rma ve kaynak gerektirir.<\/li>\n<li><strong>llama.cpp&#8217;i se\u00e7in<\/strong> e\u011fer maksimum kontrol, \u00f6zel quantization (nicemleme) veya standart olmayan donan\u0131mda \u00e7al\u0131\u015ft\u0131rma ihtiyac\u0131n\u0131z varsa. En esnek oland\u0131r ancak y\u00f6netimi en karma\u015f\u0131k oland\u0131r.<\/li>\n<\/ul>\n<p>Hangi se\u00e7imi yaparsan\u0131z yap\u0131n, donan\u0131m topolojisinin nihai k\u0131s\u0131t oldu\u011funu unutmay\u0131n. PCIe busu t\u00fcketiciler i\u00e7in her zaman darbo\u011faz olacakt\u0131r. Elektrik faturas\u0131 her zaman bir fakt\u00f6rd\u00fcr. Stabilite ise hep bir taviz meselesidir.<\/p>\n<p>Benzersiz benchmark sonu\u00e7lar\u0131na kafa yormay\u0131n. Belirli kullan\u0131m senaryonuz i\u00e7in g\u00fcvenilir \u00e7al\u0131\u015fan bir sistem kurun. 70B&#8217;lik bir model \u00e7al\u0131\u015ft\u0131rman\u0131z gerekiyorsa, iki GPU al\u0131n ve PCIe cezas\u0131n\u0131 kabul edin. H\u0131zl\u0131 bir asistan istiyorsan\u0131z tek bir b\u00fcy\u00fck kartla ba\u015flay\u0131n ve gereksiz d\u00fc\u015f\u00fcnceyi b\u0131rak\u0131n.<\/p>\n<p>Homelab&#8217;\u0131n ger\u00e7ek g\u00fcc\u00fc, en pahal\u0131 donan\u0131ma sahip olmada de\u011fil, sahip oldu\u011funuz donan\u0131m\u0131n s\u0131n\u0131rlar\u0131n\u0131 anlamada gizlidir. Bu bilgiyle verimli, kararl\u0131 ve ekonomik bir sistem kurun. Elektrik faturas\u0131, i\u015fin sonunda tek \u00f6nemlidir.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ollama vllm llama cpp: Ollama vLLM llama.cpp \u00e7oklu GPU performans\u0131, bant geni\u015fli\u011fi darbo\u011fazlar\u0131 ve kurulum rehberi ile homelab operat\u00f6rleri i\u00e7in detayl\u0131 analiz.<\/p>\n","protected":false},"author":1,"featured_media":626,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"rank_math_title":"Ollama vllm llama cpp: 2026 Guide","rank_math_description":"Ollama vllm llama cpp: Ollama vLLM llama.cpp \u00e7oklu GPU performans\u0131, bant geni\u015fli\u011fi darbo\u011fazlar\u0131 ve kurulum rehberi ile homelab operat\u00f6rleri i\u00e7in detayl\u0131 analiz.","rank_math_focus_keyword":"Ollama vllm llama cpp","footnotes":""},"categories":[1],"tags":[266,178,257,262,265,256],"class_list":["post-627","post","type-post","status-publish","format-standard","has-post-thumbnail","category-genel","tag-coklu-gpu","tag-homelab","tag-llama-cpp","tag-ollama","tag-ollama-vllm-llama-cpp-coklu","tag-vllm"],"_links":{"self":[{"href":"https:\/\/m4.ist\/index.php\/wp-json\/wp\/v2\/posts\/627","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/m4.ist\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/m4.ist\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/m4.ist\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/m4.ist\/index.php\/wp-json\/wp\/v2\/comments?post=627"}],"version-history":[{"count":0,"href":"https:\/\/m4.ist\/index.php\/wp-json\/wp\/v2\/posts\/627\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/m4.ist\/index.php\/wp-json\/wp\/v2\/media\/626"}],"wp:attachment":[{"href":"https:\/\/m4.ist\/index.php\/wp-json\/wp\/v2\/media?parent=627"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/m4.ist\/index.php\/wp-json\/wp\/v2\/categories?post=627"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/m4.ist\/index.php\/wp-json\/wp\/v2\/tags?post=627"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}