Don't cache resolved hostnames forever #16412

clintongormley · 2016-02-03T13:15:31Z

Today we use InetAddress to represent IP addresses. InetAddress handles the work of resolving hostnames from DNS and from the local hosts file.

With the security manager enabled, successful hostname lookups are cached forever to prevent spoofing attacks. I don't know if this behaviour was different before the security manager was enabled, but it seems unlikely given issues such as #10337 and #14441.

It would be a useful improvement to be able to specify unicast hosts as hostnames which are looked up from DNS or hosts, then if the IP addresses change and the node need to reconnect to the cluster, it can just do a fresh lookup to gather the current IPs. Similar logic would help the clients.

If we make this change, it should be configurable (otherwise we're introducing the chance for spoofing) and we should consider the impact on hostname verification of ssl certs.

Testing this change would be hard...

The text was updated successfully, but these errors were encountered:

danielmitterdorfer · 2016-02-12T10:41:17Z

The DNS cache of a Java process is handled by the property networkaddress.cache.ttl. Quoting Oracle Docs:

Specified in java.security to indicate the caching policy for successful name lookups from the name service.. The value is specified as integer to indicate the number of seconds to cache the successful lookup.

A value of -1 indicates "cache forever". The default behavior is to cache forever when a security manager is installed, and to cache for an implementation specific period of time, when a security manager is not installed.

danielmitterdorfer · 2016-02-12T12:52:53Z

As per the Oracle documentation on policy files the above-mentioned property (and maybe also the related property networkaddress.cache.negative.ttl (for failed DNS lookups)) have to be specified in a file called java.security. This file is located in $JRE_HOME/lib/security and the settings there are system-wide.

It is possible to provide an application-specific java.security file that overrides the system-wide defaults by adding the system property -Djava.security.properties=$CUSTOM_SECURITY_FILE and specifying application-specific overrides there but this is only possible if the property security.overridePropertiesFile is set to true in $JRE_HOME/lib/security/java.security (the default is true). So in case an administrator disables this, we won't be able to override the system-wide setting.

I wonder whether we should just point users to the Oracle documentation on how to change the DNS cache lifetime system-wide because I am not sure whether they want to configure different (Java) DNS cache lifetimes for different applications on the same machine anyway. Wdyt @clintongormley?

clintongormley · 2016-02-13T11:31:55Z

@danielmitterdorfer I'd be happy with just adding this documentation. Is there anything we need to change code-wise as well? Or perhaps just adding tests to ensure that the documented solution works?

clintongormley · 2016-02-13T13:49:44Z

Also related to #10337 and logstash-plugins/logstash-output-elasticsearch#131 and #11256

miah · 2016-02-13T16:54:46Z

We run all our jvms with this configuration. We also only pass hostnames to
ES via command arguments at startup. When we roll out new instances on aws
ES never refreshes the hosts IPs via DNS lookups.

Would appreciate if somebody else could verify that it works as expected
because it doesn't seem to in our environment.
On Feb 13, 2016 3:33 AM, "Clinton Gormley" notifications@github.com wrote:

@danielmitterdorfer https://github.com/danielmitterdorfer I'd be happy
with just adding this documentation. Is there anything we need to change
code-wise as well? Or perhaps just adding tests to ensure that the
documented solution works?

—
Reply to this email directly or view it on GitHub
#16412 (comment)
.

danielmitterdorfer · 2016-02-15T13:09:07Z

@clintongormley I would just point users to the Oracle documentation and not add any tests for two reasons:

This is a configuration that is purely related to the JVM and nothing of this is Elasticsearch specific
It is a global change so the JDK configuration has to change in every environment where the tests are run.

danielmitterdorfer · 2016-02-15T13:30:54Z

@miah I have created a small demo program to verify that everything works as expected (source code as gist). In addition, I've set these values in my java.security in the JRE/lib directory:

networkaddress.cache.ttl=10
networkaddress.cache.negative.ttl=-1

When you look at the demo source code, you'll see that we query for one existing host ("www.google.com") and for one non-existing one ("www.this-does-not-exist.io"). My expectation is that we see DNS requests every 10 seconds for the existing host and only one for the non-existing one when we periodically clear the OS DNS cache.

When you invoke the demo program as described in the gist and open tcpdump (I did sudo tcpdump -vvv -s 0 -l -n port 53) and periodically clear the DNS cache (depending on your OS) you'll see the expected behavior (my environment is Mac OS X 10.11):

Output from the application:

14:04:10
www.google.com => 173.194.39.18
14:04:15
www.google.com => 173.194.39.18
14:04:20
www.google.com => 173.194.39.18
14:04:25
www.google.com => 173.194.39.18
14:04:30
www.google.com => 173.194.39.18
14:04:35
www.google.com => 173.194.39.18

Output from tcpdump:

tcpdump: data link type PKTAP
tcpdump: listening on pktap, link-type PKTAP (Packet Tap), capture size 262144 bytes
14:04:10.592232 IP (tos 0x0, ttl 255, id 64876, offset 0, flags [none], proto UDP (17), length 60)
    192.168.1.103.60113 > 192.168.1.1.53: [udp sum ok] 60929+ A? www.google.com. (32)
14:04:10.592301 IP (tos 0x0, ttl 255, id 6814, offset 0, flags [none], proto UDP (17), length 60)
    192.168.1.103.58959 > 192.168.1.1.53: [udp sum ok] 36807+ AAAA? www.google.com. (32)
14:04:10.624765 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 140)
    192.168.1.1.53 > 192.168.1.103.60113: [udp sum ok] 60929 q: A? www.google.com. 5/0/0 www.google.com. [4m48s] A 173.194.39.20, www.google.com. [4m48s] A 173.194.39.18, www.google.com. [4m48s] A 173.194.39.17, www.google.com. [4m48s] A 173.194.39.16, www.google.com. [4m48s] A 173.194.39.19 (112)
14:04:10.625961 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 88)
    192.168.1.1.53 > 192.168.1.103.58959: [udp sum ok] 36807 q: AAAA? www.google.com. 1/0/0 www.google.com. [4m54s] AAAA 2a00:1450:4005:800::1010 (60)
14:04:10.695845 IP (tos 0x0, ttl 255, id 17766, offset 0, flags [none], proto UDP (17), length 72)
    192.168.1.103.49957 > 192.168.1.1.53: [udp sum ok] 27956+ A? www.this-does-not-exist.io. (44)
14:04:10.695942 IP (tos 0x0, ttl 255, id 63692, offset 0, flags [none], proto UDP (17), length 72)
    192.168.1.103.51379 > 192.168.1.1.53: [udp sum ok] 47751+ AAAA? www.this-does-not-exist.io. (44)
14:04:10.727572 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 141)
    192.168.1.1.53 > 192.168.1.103.49957: [udp sum ok] 27956 NXDomain q: A? www.this-does-not-exist.io. 0/1/0 ns: io. [19m14s] SOA ns1.communitydns.net. nicadmin.nic.io. 1455538915 3600 1800 3600000 3600 (113)
14:04:10.728133 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 141)
    192.168.1.1.53 > 192.168.1.103.51379: [udp sum ok] 47751 NXDomain q: AAAA? www.this-does-not-exist.io. 0/1/0 ns: io. [19m14s] SOA ns1.communitydns.net. nicadmin.nic.io. 1455538915 3600 1800 3600000 3600 (113)
14:04:20.733510 IP (tos 0x0, ttl 255, id 51420, offset 0, flags [none], proto UDP (17), length 60)
    192.168.1.103.58825 > 192.168.1.1.53: [udp sum ok] 11803+ A? www.google.com. (32)
14:04:20.733566 IP (tos 0x0, ttl 255, id 13172, offset 0, flags [none], proto UDP (17), length 60)
    192.168.1.103.60441 > 192.168.1.1.53: [udp sum ok] 14292+ AAAA? www.google.com. (32)
14:04:20.737039 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 140)
    192.168.1.1.53 > 192.168.1.103.58825: [udp sum ok] 11803 q: A? www.google.com. 5/0/0 www.google.com. [4m38s] A 173.194.39.19, www.google.com. [4m38s] A 173.194.39.16, www.google.com. [4m38s] A 173.194.39.17, www.google.com. [4m38s] A 173.194.39.18, www.google.com. [4m38s] A 173.194.39.20 (112)
14:04:20.737435 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 88)
    192.168.1.1.53 > 192.168.1.103.60441: [udp sum ok] 14292 q: AAAA? www.google.com. 1/0/0 www.google.com. [4m44s] AAAA 2a00:1450:4005:800::1010 (60)
14:04:30.748453 IP (tos 0x0, ttl 255, id 17598, offset 0, flags [none], proto UDP (17), length 60)
    192.168.1.103.55087 > 192.168.1.1.53: [udp sum ok] 1167+ A? www.google.com. (32)
14:04:30.748806 IP (tos 0x0, ttl 255, id 29531, offset 0, flags [none], proto UDP (17), length 60)
    192.168.1.103.60394 > 192.168.1.1.53: [udp sum ok] 60974+ AAAA? www.google.com. (32)
14:04:30.754529 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 140)
    192.168.1.1.53 > 192.168.1.103.55087: [udp sum ok] 1167 q: A? www.google.com. 5/0/0 www.google.com. [4m28s] A 173.194.39.20, www.google.com. [4m28s] A 173.194.39.19, www.google.com. [4m28s] A 173.194.39.16, www.google.com. [4m28s] A 173.194.39.17, www.google.com. [4m28s] A 173.194.39.18 (112)
14:04:30.754533 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 88)
    192.168.1.1.53 > 192.168.1.103.60394: [udp sum ok] 60974 q: AAAA? www.google.com. 1/0/0 www.google.com. [4m34s] AAAA 2a00:1450:4005:800::1010 (60)

You see that we query only once for "www.this-does-not-exist.io" but every 10 seconds for "www.google.com". Similarly, you should also see only one DNS request for "google.com" when you set networkaddress.cache.ttl=-1 in java.security.

danielmitterdorfer · 2016-02-15T13:36:15Z

@miah One more thing: Note that the program runs with the security manager enabled. The JDK implementation behaves differently whether or not a security manager is enabled (see also my comment above with the link to the Oracle docs). As Elasticsearch 3.0 will make security mandatory (#16176) it is a sensible assumption that we assume here too that a security manager is enabled.

alexbrasetvik · 2016-02-25T07:48:22Z

It would also be nice if the transport client could do the lookups when it connects, instead of just when the client object is created, as IPs can change e.g. when going through a load balancer.

beiske · 2016-02-25T13:13:37Z

Despite keeping the hostname an InetAdress will never attempt to resolve the ip after it has been created. I think this is a source of confusion for many using the transport client. It is particularly important when connecting to the cloud service. Due to the load balancers even a single node cluster has multiple ip adresses and they may change at any point in time.

@clintongormley Is this issue also relevant for the transport client or should we make a separate issue?

danielmitterdorfer · 2016-02-26T09:50:48Z

@alexbrasetvik, @beiske: I don't know what @clintongormley thinks about your idea but I think it would be better to create a new ticket for the transport client topic.

lifeofguenter · 2016-04-11T13:17:34Z

What's the status on this? This causes problems with AWS ElasticSearch service.

danielmitterdorfer · 2016-04-11T14:07:57Z

@lifeofguenter The topic discussed in this ticket has nothing to do with Elasticsearch per se. It is a pure JVM level setting so (in the scope of this ticket) we will not change any code but maybe just add documentation on how to change this setting in Elasticsearch.

Just to be sure: by "AWS ElasticSearch service" you refer to Amazon's service and not our Elasticsearch cloud offering (called Elastic Cloud), right? We have no additional insight on Amazon's offering and I am afraid you will also not be able to change a JVM level setting there. I fear this has to be addressed by Amazon (as it is a JVM level setting that we cannot change from within the application).

lifeofguenter · 2016-04-11T14:18:29Z

I did the following changes (ubuntu 14.04) in /usr/lib/jvm/jdk-8-oracle-x64/jre/lib/security/java.security:

networkaddress.cache.ttl=60
networkaddress.cache.negative.ttl=10

But that somehow did not do the trick?

Yes, I am referring to: https://aws.amazon.com/elasticsearch-service/ - however we are running logstash as per: http://www.lifeofguenter.de/2016/01/elk-aws-elasticbeanstalk-laravel.html - which is hosted on our "own" EC2 instance, thus we are able to do changes, and thats also the link that currently complaints if the dns record to ElasticSearch changes.

UPDATE: sorry my problems are most probably unrelated!

miah · 2016-04-25T19:07:04Z

Still have this issue...

If I replace masters, I need to reboot every node in the cluster otherwise it never detects the IP changes.

I have a DNS TTL of 6 minutes. I replaced my master servers, and 20 minutes later elasticsearch is still trying to connect to the old IP's. I have the java.security changes in place. Elasticsearch is configured to connect to a round-robin dns entry for the master nodes.

services 16207 17.0 63.3 35426032 9753368 ?    Sl   Apr05 4911:31 /usr/lib/jvm/java-8-oracle/bin/java --http.port 9200 
--transport.tcp.port 9300 
--cluster.name=logsearch-dev 
--cluster.routing.allocation.allow_rebalance=always --cluster.routing.allocation.cluster_concurrent_rebalance=2 
--cluster.routing.allocation.node_concurrent_recoveries=2 --cluster.routing.allocation.node_initial_primaries_recoveries=12 
--cluster.routing.allocation.enable=all 
--node.name=logsearch-data-5.dev.bs.com
--node.master=false 
--node.data=true 
--node.auto_attributes=true  
--discovery.zen.minimum_master_nodes=2 
--discovery.zen.ping.multicast.enabled=false 
--discovery.zen.ping.unicast.hosts=app.logsearch-master.dev.bs.com 
--discovery.zen.ping_timeout=10s 
--discovery.zen.fd.ping_interval=1s 
--discovery.zen.fd.ping_timeout=60s 
--discovery.zen.fd.ping_retries=3

grep cache /usr/lib/jvm/java-8-oracle/jre/lib/security/java.security
# The Java-level namelookup cache policy for successful lookups:
# any positive value: the number of seconds to cache an address for
# zero: do not cache
# is to cache for 30 seconds.
networkaddress.cache.ttl=0
# The Java-level namelookup cache policy for failed lookups:
# any negative value: cache forever
# any positive value: the number of seconds to cache negative lookup results
# zero: do not cache
networkaddress.cache.negative.ttl=0

jasontedor · 2016-04-25T19:12:28Z

@clintongormley @danielmitterdorfer I don't think it's correct that today setting the DNS cache properties at the JVM level is going to resolve the problems being reported here. The underlying reason is that we do hostname lookup during initialization of unicast zen ping and never do lookups again. This is currently a deliberate choice.

miah · 2016-04-25T19:17:18Z

@jasontedor That definitely seems to be the case.

danielmitterdorfer · 2016-05-13T14:23:29Z

@jasontedor Agreed. In that case the DNS cache settings will not help. As you explicitly mention that this is a deliberate choice, does it make sense to close this ticket then (and maybe document the decision or at least its consequences)?

bleskes · 2016-05-14T10:27:17Z

This is currently a deliberate choice.

I think it's OK to re-resolve the configured unicast host list when pinging. We don't do it often (only on master loss/initialization ) and we also ping all ips of last known nodes on top of it.

My only concern is that DNS resolution timeout/failure should not block the pinging or delay it (remember we do it on master loss and we block writes until pinging is done). This means implementing this can be tricky code wise (that code is already hairy)

jasontedor · 2016-05-14T10:57:54Z

I think it's OK to re-resolve the configured unicast host list when pinging.

I do too, I'm only explaining why the DNS cache settings here did and do nothing.

danielmitterdorfer · 2016-05-18T15:19:29Z

So this sounds to me we should remove the "Discuss" label and add "AdoptMe" instead.

thxmasj · 2016-10-07T12:45:36Z

Any good workarounds for this? I have a similar problem running on Docker with swarm mode, where the master/gossip nodes are runnig as a service and the data nodes point to the service name. As Docker uses DNS for the discovery this is a problem there as well.

Nils-Magnus · 2016-10-11T13:05:14Z

@thxmasj I provide the full list (as reported by Docker) explicitly. That mitigates, but does not resolve the problem.

jasontedor · 2016-11-23T01:53:25Z

This is now addressed in the forthcoming 5.1.0 (no date, sorry). If you are in an environment where DNS changes are a thing, you will have to adjust the JVM DNS cache settings in your system security policy. Please consult the Zen discovery docs for details.

dustinschultz · 2016-12-22T18:19:48Z

Has anyone tried enabling client.transport.sniff=true as a workaround? Curious if this would work around the issue.

devulapalli8 · 2017-03-15T00:23:25Z

is there any update on this issue ? whats the fix

jasontedor · 2017-03-15T00:25:34Z

is there any update on this issue ? whats the fix

Yes, it's addressed starting in Elasticsearch 5.1.1. You can read about this in the zen discovery docs.

devulapalli8 · 2017-03-15T00:30:55Z

Thanks for quick reply. I'm using ES 2.1 version , when one of ES instances is rebuilt then its throwing NoRouteToHostException. to resolve the issue its forcing me rebuild dependent applications which connects to ES as client. client.transport.sniff=true will enable to discover the new nodes added to the ES cluster. Is there workaround for ES2.1 version

devulapalli8 · 2017-03-15T00:33:49Z

As mentioned in the document , I still set networkaddress.cache.ttl=0 in java.security but it didnt resolve the issue since when i ping the new ES instance its resolved with new IP address as expected so I dont think its DNS caching issue.

jasontedor · 2017-03-15T03:25:25Z

It's not addressed in the 2.x series, there is nothing you can do there. It's only resolved since 5.1.1.

devulapalli8 · 2017-03-15T16:12:06Z

Thank you for the clarification. Could you please share your inputs on this.

Below are my observations with ES cluster 3 nodes/instances
we have some client application (say rest services) establishes ES client connections with ES cluster 3 nodes during server startup . I see all connections ESTABLISHED with those 3 ES nodes IP addresses with command netstat -an | grep 9300

when I rebuild ES node one and waited until new ES node is built with new IP and is added back to cluster with green health then followed to rebuild other 2 nodes similarly . Rebuilding ES nodes one after one - now 3 nodes having 3 new IP addresses

client.transport.sniff = false
I see NoRouteToHostException being logged continuously. when the last node is rebuilt I see error "org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes are available: [{#transport#-1}{
" and also netstat -an | grep 9300 command NOT showing connections with new ES IP addresses. It still shows old IP connections.I dont see my Rest service application able to perform CRUD operations on ES new nodes.

To fix this I need to restart my application to get new connections with new ES nodes.

client.transport.sniff=true

NoRouteToHostException is being logged continuously. But I did not see org.elasticsearch.client.transport.NoNodeAvailableException.

I am able to perform CRUD operations from my client application and also netstat -an | grep 9300 command shows connections with new ES IP addresses

NoRouteToHostException is in this case (client.transport.sniff=true) is just WARN and not causing any impact on the CRUD operations on ES. But logs getting grow due to that exception.

Please share your thoughts on this.

jasontedor · 2017-03-15T16:20:41Z

Please open a topic on the forum. We prefer to use the forums for general discussions, and reserve GitHub for verified bug reports and feature requests.

devulapalli8 · 2017-03-15T19:52:25Z

thanks , its done.

nickhristov · 2019-08-09T16:27:44Z

Seeing the same issue with spring boot 2:

[INFO] +- org.springframework.boot:spring-boot-starter-data-elasticsearch:jar:2.1.0.RELEASE:compile
[INFO] |  \- org.springframework.data:spring-data-elasticsearch:jar:3.1.2.RELEASE:compile
[INFO] |     +- joda-time:joda-time:jar:2.10.1:compile
[INFO] |     +- org.elasticsearch.client:transport:jar:6.4.2:compile
[INFO] |     |  +- org.elasticsearch:elasticsearch:jar:6.4.2:compile
[INFO] |     |  |  +- org.elasticsearch:elasticsearch-core:jar:6.4.2:compile
[INFO] |     |  |  +- org.elasticsearch:elasticsearch-secure-sm:jar:6.4.2:compile
[INFO] |     |  |  +- org.elasticsearch:elasticsearch-x-content:jar:6.4.2:compile

nickhristov · 2019-08-09T17:02:35Z

Looks like a spring bug, not yours sorry.

clintongormley added >enhancement discuss :Distributed/Network Http and internode communication implementations labels Feb 3, 2016

brackxm mentioned this issue Mar 15, 2016

transport client dns refresh #17107

Closed

epallerols mentioned this issue Mar 16, 2016

Add nss-lookup target to CentOS 7 unit voxpupuli/puppet-elasticsearch#601

Closed

clintongormley added help wanted adoptme and removed discuss labels May 18, 2016

mikemccand mentioned this issue Nov 9, 2016

Transport client doesn't handle DNS -> IP binding changing #21424

Closed

jasontedor self-assigned this Nov 17, 2016

jasontedor removed the help wanted adoptme label Nov 17, 2016

jasontedor mentioned this issue Nov 17, 2016

Lazy resolve unicast hosts #21630

Merged

jasontedor closed this as completed in #21630 Nov 22, 2016

Vanuan mentioned this issue Jul 6, 2018

Proposal: Make Pods (collections of containers) a first order container object. moby/moby#8781

Open

Don't cache resolved hostnames forever #16412

Don't cache resolved hostnames forever #16412

Comments

clintongormley commented Feb 3, 2016

danielmitterdorfer commented Feb 12, 2016

danielmitterdorfer commented Feb 12, 2016

clintongormley commented Feb 13, 2016

clintongormley commented Feb 13, 2016

miah commented Feb 13, 2016

danielmitterdorfer commented Feb 15, 2016

danielmitterdorfer commented Feb 15, 2016

danielmitterdorfer commented Feb 15, 2016

alexbrasetvik commented Feb 25, 2016

beiske commented Feb 25, 2016

danielmitterdorfer commented Feb 26, 2016

lifeofguenter commented Apr 11, 2016

danielmitterdorfer commented Apr 11, 2016

lifeofguenter commented Apr 11, 2016

miah commented Apr 25, 2016 • edited

jasontedor commented Apr 25, 2016 • edited

miah commented Apr 25, 2016

danielmitterdorfer commented May 13, 2016

bleskes commented May 14, 2016

jasontedor commented May 14, 2016

danielmitterdorfer commented May 18, 2016

thxmasj commented Oct 7, 2016

Nils-Magnus commented Oct 11, 2016

jasontedor commented Nov 23, 2016

dustinschultz commented Dec 22, 2016

devulapalli8 commented Mar 15, 2017 • edited

jasontedor commented Mar 15, 2017 • edited

devulapalli8 commented Mar 15, 2017

devulapalli8 commented Mar 15, 2017

jasontedor commented Mar 15, 2017

devulapalli8 commented Mar 15, 2017 • edited

jasontedor commented Mar 15, 2017

devulapalli8 commented Mar 15, 2017 • edited

nickhristov commented Aug 9, 2019

nickhristov commented Aug 9, 2019

miah commented Apr 25, 2016 •

edited

jasontedor commented Apr 25, 2016 •

edited

devulapalli8 commented Mar 15, 2017 •

edited

jasontedor commented Mar 15, 2017 •

edited

devulapalli8 commented Mar 15, 2017 •

edited

devulapalli8 commented Mar 15, 2017 •

edited