cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Software Service traffic decreased

sylvian_lam1
Organizer

Hi Everyone,

I've defined a software service and running for months, however the traffic were disappeared yesterday morning. Very strange so that post here for help.

The target traffic subnet is 10.68.134.x and port 1494 / 2598.

When I compare the whole day traffic of on 26-May and 27-May , I found that the traffic on 27May are significant decreased, seems traffic were decreased gradually till no traffic anymore.

The AMD is running SFP+ 10 GB port, using "customized driver". I used "tcpdump" (both OS level and rcon) to capture the packet but no port traffic 1494 or 2598, only broadcast traffic is found. It seems no target traffic in the mirror port. I logged call to Helpdesk and they stop investigation due to no target traffic in the mirror port.

What had been changed on 26-May: consolidate multiple VLANs to a single SFP+ port.

Type : Local Session
Source VLANs :
Both : 401,430,434,603
Destination Ports : Te1/4
Encapsulation : Native
Ingress : Disabled
Learning : Disabled
Filter Pkt Type :
RX Only : Good

Did anyone see this problem before? Is it a hardware issue?

Much appreciated for any help.

Sylvian

 

14 REPLIES 14

ulf_thorn222
Inactive

So to flip that around - what VLAN do you see in the Rum Console?

Isn't it just so that the packets in the VLANs have been moved to other VLANs and therefore you only see broad&multicast traffic?

Check what VLANs you have at thte source port/host.

sylvian_lam1
Organizer

Hi Ulf,

Thanks for your response. In RUM Console, I only can see broadcast traffic from the 10.68.134.x, however I can get other traffic from 10.68.130.x. My target traffic is on eth3. I wonder how to prove it is a hardware issue.

ulf_thorn222
Inactive

Hi

Not a hardware issue.  A configuration issue in the switch. I can't think of anything that would make this a hardware issue but then again, sometime has to be the first time.

That's again why I like to go back to my old proverb "never trust SPAN".

Something in the VLAN setup has changed and the SPAN needs to be reconfigured to show this. If you would have had a TAP in place, none of this would have happend and you would also have seen packet errors and other things that doens't go through a SPAN session.

sylvian_lam1
Organizer

Thanks Ulf. Unfortunately SPAN is the only way in here.

I can't verify any change on the VLAN configuration but hundreds of users are using it without issue. I wonder how SPAN configuration affect the output, double check with network team and they said no filter had been configured on the SPAN port..... no way out. 

By the way, do you have the update list of 10GB (SFP+) network card that support "customized driver"? 

Much appreciated.

Sylvian

Hi

I'm not aware of any updated list of NICs more than the one online Tested Cards

OK - the network guys are hard to convince?

Ask them what VLAN your source servers are in and ask them to tell about ALL VLANs they are in, not just the ones they think you want to know about. Then check that against your output on the SPAN port. Also ask them to show how the SPAN is set up.

Try to see if the person who setup the SAPN originally can verify what it does and that it still works.

sylvian_lam1
Organizer

Hi Everyone,

We've changed the SPAN port, this time can see some XenApp7 traffic (10.68.134.0) and XenApp6 traffic (10.68.130.0), both server numbers lower than expected.

So I'd like back to the basic to verify some points here, any ideas and comments are welcome, I can't think of any other way out.

Environment / Configuration:
1) The environment is multiple VM (Esxi) running within different Hypervisor --> connect to CISCO 49xx switches
2) Traffic was span from one of the CISCO 49xx switch (span the whole VLAN), config as below
           Type : Local Session
           Source VLANs :
           Both : 401,430,434,603
           Destination Ports : Te1/4
           Encapsulation : Native
           Ingress : Disabled
           Learning : Disabled
           Filter Pkt Type :
           RX Only : Good
3) SPAN port is a 10GB SFP+ (peak port utilization is below 2.6 Gbps, average below 1 Gbps)
4) AMD customized driver is used, sniffing card comes with CISCO UCS
           Ethernet controller: Broadcom Corporation NetXtreme II BCM57711 10-Gigabit PCIe
           Ethernet controller: Broadcom Corporation NetXtreme II BCM57711 10-Gigabit PCIe

Symptoms:
1) Found packet overrun on the sniffing card
2) Server number of XenApp6 and XenApp7 lower than expected. XenApp6 server (23 out of 150).

My doubt:
1) In this environment and configuration, should it be able to capture all server traffic within VLAN?
2) Any best practice or recommended deployment or configuration to capture server traffic for such server environment? I think of virtual AMD or Gigamon-VM but VM admin rejected due to loading of virtual-switch and hypervisor.
3) Anyone deal with packet overrun issue before?

Much appreciated.

Sylvian

ulf_thorn222
Inactive

Hi

THe information is to fragmented to draw conclusions on. How many Cicos 9'ers do you have?

How many ESX hosts?

Do you have a VLAN plan?

What is "lower than expected" and what are the expected numbers based on (volume/users/sessions/bandwidth)?

I believe the packet overrun can be related to your buffers - how much memory do the AMD have and can you tweak it?

sylvian_lam1
Organizer

Hi Ulf,

Thanks for your response.

There are more than 14 Cisco 49xx , around 50 hypervisor and over 300 vm servers.... the software service I monitored expect to see 150 servers but now only see 20 or less.

Two sort of findings(or suspect)  :

1) The packet overrun may be caused by duplicate packet. However duplicate packet is easily seen in SPAN port, so how to reduce the duplicate packet seems difficult.

2) The servers that DCRUM missed may be caused by inter-vm traffic, ie. the traffic didn't go out the switch and just within the virtual switch.

Do you think the above make sense? Any ideas?

Sylvian

 

Hi

OK - so you have a sizeable installation with a few items to keep track of (big grin).

I always try to start from the outside, working in. This means that for sure, I want to capture traffic as it leaves the DC. Then I know I will have the users and can keep an eye on the number and the response time. Then I work my way in through the infrastructure as far as is needed or technically feasible.

Without having a drawing or some better understanding of all the pieces in your puzzle, it's very hard to tell where you are or what you listen to, but I sense that either the network folks or the VM admins have been "helping" you and forgot to tell what they have done. If you have access to the traffic going to the users and this is still the same, I guess (just as you say) that the caracteristics of the DC traffic has changed due to either intra-VM traffic or due to change of VLANs.

Hello, drawn her, for similar issue - the NetEng claims (and states) that the intended VLANs are configured for our SPAN, yet we fail to see the VLAN or its traffic.

I was just going to the FAQ, and since Sylvian is mentioning also having 10G cards in use, I wonder if below setting limits us in reading all the data on the SPAN?
Does'nt it cap 2 of the possible 10G?

How to catch the traffic on the AMD over 2 GB?

Go to your AMD:

/usr/adlex/config/cba.config.xml and changed the value

<maxSampleSize>2046</maxSampleSize>


Ref:

https://community.dynatrace.com/community/display/...

sylvian_lam1
Organizer

It seems I missed Ulf's reply coz' I didn't receive any email notice from forum. Thanks your response Ulf.

Hi Frans,

I don't know if maxSampleSize can help, sometimes I saw interface utilization (from AMD diagnostic) was over 3GB so it is hard to confirm if "maxSampleSize" make the difference.

Can you confirm if the AMD can see other traffic or only that VLAN traffic is missing?

I'm using RedHat 6.6, in my case I need to switch over "native" and "customize" driver to "wake up" the 10G interface card, then use "rtminst" to check if the traffic coming in.

Sylvian

Hi,

I think I have to revise my post. Maybe bit mislead by the FAQ title. Max sample is I think the size (2G) of the sample files.
So I doubt that not seeing ANY traffic of certain VLAN is related to that.

Our AMD is seeing plenty data, but not form a certain one that just has been added. And claimed with proof that the VLAB is added to the SPAN. I seriously doubt that the data actually is reaching 'our' span port. But need to rule out that the AMD in some way is the cause.

However I am under the impression that 9 out of 10 it's network-side configuration that is the cause.

In the RUM Console you can see what VLANs are recorded.
I am looking for a CLI (rcon?) command to retrieve this list.

We run now in custom driver.

sandrine-extern
Advisor

Hi,

Have you tried to turn back to native drivers?

Custom drivers are filtering automatically the traffic in order to allow only the traffic that is under surveillance to arrive to the probe. I may block some traiffc if there is any error with your Software Service definition.

Maybe should you try to use native driver and then run a TCPDUMP (the OS one) to see if you capture all of the needed traffic. What do you think?

Regards,

Sandrine

sylvian_lam1
Organizer

Agreed with Sandrine.

Frans,

I used the following to verify the traffic, see if it helps in your case.

Procedure to check traffic (OS level)

switch to native driver

a) stop AMD by ndstop.
b) capture the trace in linux prompt with the command:
tcpdump -w "/tmp/test??.pcap" -i eth3 –vv

c) to verify:

tcpdump -r test.pcap |grep "10.10.128" > temp_traffic 2>&1